date:20210610

Re: [RFC PATCH v1] vsock: add mergeable rx buffer description

2021-06-10 Thread Jason Wang



在 2021/6/11 上午2:39, Jiang Wang 写道:

Mergeable rx buffer is already supported by virtio-net, and
it can save memory for big packets. It will also be beneficial
for the vsock devices, so add it to the spec.



A lot of duplication with the virtio-net mergeable rx buffer description.

I wonder whether we can have a generic feature like VIRTIO_F_MRG_BUFFER 
instead.





---
V0 -> V1: I send similar patch with vsock dgram before and
already got some comments. This version fixed those,such as
use present tense for feature bit etc. Also the feature bit
value is 3, because we expect to have some other featue bits
defined soon.

  virtio-vsock.tex | 78 +++-
  1 file changed, 77 insertions(+), 1 deletion(-)

diff --git a/virtio-vsock.tex b/virtio-vsock.tex
index da7e641..d529291 100644
--- a/virtio-vsock.tex
+++ b/virtio-vsock.tex
@@ -16,7 +16,9 @@ \subsection{Virtqueues}\label{sec:Device Types / Socket 
Device / Virtqueues}
  
  \subsection{Feature bits}\label{sec:Device Types / Socket Device / Feature bits}
  
-There are currently no feature bits defined for this device.

+\begin{description}
+\item[VIRTIO_VSOCK_F_MRG_RXBUF (3)] Driver can merge receive buffers.
+\end{description}
  
  \subsection{Device configuration layout}\label{sec:Device Types / Socket Device / Device configuration layout}
  
@@ -64,6 +66,8 @@ \subsection{Device Operation}\label{sec:Device Types / Socket Device / Device Op
  
  Packets transmitted or received contain a header before the payload:
  
+If feature VIRTIO_VSOCK_F_MRG_RXBUF is not negotiated, use the following header.

+
  \begin{lstlisting}
  struct virtio_vsock_hdr {
le64 src_cid;
@@ -79,6 +83,15 @@ \subsection{Device Operation}\label{sec:Device Types / 
Socket Device / Device Op
  };
  \end{lstlisting}
  
+If feature VIRTIO_VSOCK_F_MRG_RXBUF is negotiated, use the following header.

+\begin{lstlisting}
+struct virtio_vsock_hdr_mrg_rxbuf {
+   struct virtio_vsock_hdr hdr;
+   le16 num_buffers;
+};
+\end{lstlisting}
+
+
  The upper 32 bits of src_cid and dst_cid are reserved and zeroed.
  
  Most packets simply transfer data but control packets are also used for

@@ -170,6 +183,23 @@ \subsubsection{Buffer Space Management}\label{sec:Device 
Types / Socket Device /
  previously receiving a VIRTIO_VSOCK_OP_CREDIT_REQUEST packet. This allows
  communicating updates any time a change in buffer space occurs.
  
+\drivernormative{\paragraph}{Device Operation: Buffer Space Management}{Device Types / Socket Device / Device Operation / Setting Up Receive Buffers}

+\begin{itemize}
+\item If VIRTIO_VSOCK_F_MRG_RXBUF is negotiated, each buffer MUST be at
+least the size of the struct virtio_vsock_hdr_mgr_rxbuf.
+\end{itemize}
+
+\begin{note}
+Obviously each buffer can be split across multiple descriptor elements.
+\end{note}
+
+\devicenormative{\paragraph}{Device Operation: Buffer Space Management}{Device 
Types / Socket Device / Device Operation / Setting Up Receive Buffers}
+The device MUST set \field{num_buffers} to the number of descriptors used when
+transmitting the  packet.



Interesting, does this mean it works for tx? Virtio-net had:

"The driver MUST set num_buffers to zero."




+
+The device MUST use only a single descriptor if VIRTIO_VSOCK_F_MRG_RXBUF
+is not negotiated.



Though virtio-net using something similar. But I think we need to 
clarify, what we really need is "a single buffer" not "a single descriptor".




+
  \drivernormative{\paragraph}{Device Operation: Buffer Space 
Management}{Device Types / Socket Device / Device Operation / Buffer Space 
Management}
  VIRTIO_VSOCK_OP_RW data packets MUST only be transmitted when the peer has
  sufficient free buffer space for the payload.
@@ -188,6 +218,7 @@ \subsubsection{Receive and Transmit}\label{sec:Device Types 
/ Socket Device / De
  The driver queues outgoing packets on the tx virtqueue and incoming packet
  receive buffers on the rx virtqueue. Packets are of the following form:
  
+If VIRTIO_VSOCK_F_MRG_RXBUF is not negotiated, use the following.

  \begin{lstlisting}
  struct virtio_vsock_packet {
  struct virtio_vsock_hdr hdr;
@@ -195,9 +226,41 @@ \subsubsection{Receive and Transmit}\label{sec:Device 
Types / Socket Device / De
  };
  \end{lstlisting}
  
+Otherwise, use the following form:

+\begin{lstlisting}
+struct virtio_vsock_packet_mrg_rxbuf {
+struct virtio_vsock_hdr_mrg_rxbuf hdr;
+u8 data[];
+};
+\end{lstlisting}
+
  Virtqueue buffers for outgoing packets are read-only. Virtqueue buffers for
  incoming packets are write-only.
  
+When transmitting packets to the device, \field{num_buffers} is not used.

+
+\begin{enumerate}
+\item \field{num_buffers} indicates how many descriptors



Similarly, I think what we want here is "buffers" instead of "descriptors".

Thanks



+  this packet is spread over (including this one).
+  This is valid only if VIRTIO_VSOCK_F_MRG_RXBUF is negotiated.
+  This allows receipt of large packets

Re: [PATCH V3 1/3] gpio: Add virtio-gpio driver

2021-06-10 Thread Viresh Kumar

Hi Linus,

On 10-06-21, 22:46, Linus Walleij wrote:
> Hi Viresh!
> 
> thanks for working on this, it's a really interesting driver.
> 
> My first question is conceptual:
> 
> We previously have Geerts driver for virtualization:
> drivers/gpio/gpio-aggregator.c
> 
> The idea with the aggregator is that a host script sets up a
> unique gpiochip for the virtualized instance using some poking
> in sysfs and pass that to the virtual machine.
> So this is Linux acting as virtualization host by definition.
> 
> I think virtio is more abstract and intended for the usecase
> where the hypervisor is not Linux, so this should be mentioned
> in the commit, possibly also in Kconfig so users immediately
> know what usecases the two different drivers are for.

Well, not actually.

The host can actually be anything. It can be a Xen based dom0, which
runs some proprietary firmware, or Qemu running over Linux.

It is left for the host to decide how it wants to club together the
GPIO pins from host and access them, with Linux host userspace it
would be playing with /dev/gpiochipN, while for a raw one it may
be accessing registers directly.

And so the backend running at host, needs to pass the gpiochip
configurations and only the host understand it.

The way I test it for now is by running this with Qemu over my x86
box, so my host side is indeed playing with sysfs Linux.

> Possibly both could be used: aggregator to pick out the GPIOs
> you want into a synthetic GPIO chip, and the actual talk
> between the hypervisor/host and the guest using virtio, even
> with linux-on-linux.

Not sure if I understand the aggregator thing for now, but we see the
backend running at host (which talks to this Linux driver at guest) as
a userspace thing and not a kernel driver. Not sure if aggregator can
be used like that, but anyway..

> Yet another usecase would be to jit this with remoteproc/rpmsg
> and let a specific signal processor or real-time executive on
> another CPU with a few GPIOs around present these to
> Linux using this mechanism. Well that would certainly interest
> Bjorn and other rpmsg stakeholders, so they should have
> a look so that this provides what they need they day they
> need it. (CCed Bjorn and also Google who may want this for
> their Android emulators.)

I am not very clear on the rpmsg thing, I know couple of folks at
project Stratos were talking about it :)

@Alex, want to chime in here for me ? :)

> On Thu, Jun 10, 2021 at 2:16 PM Viresh Kumar  wrote:
> 
> > +static const char **parse_gpio_names(struct virtio_device *vdev,
> > +  struct virtio_gpio_config *config)
> 
> I really like this end-to-end plug-and-play that even provides
> the names over virtio.

The credit goes to Enrico for this :)

> I think my patch to the gpiolib to make it mandatory for names to
> be unique per-chip made it in, but please make that part of the spec
> so that we don't get the problem with non-unique names here.

Oh, that's nice. I will surely do that.

> I suppose the spec can be augmented later to also accept config
> settings like open drain pull up/down etc but no need to specify
> more than the basic for now.

That's the plan.

> But to be able to add more in the future, the client needs some
> kind of query mechanism or version number so the driver can
> adapt and not announce something the underlying virtio device
> cannot do. Do we have this? A bitmask for features, a version
> number that increase monotonically for new features to be
> presented or similar?
> 
> Because otherwise we have to bump this:
> +#define VIRTIO_ID_GPIO 41 /* virtio GPIO */
> 
> every time we add something new (and we will).

Yes, Virtio presents features for this. The patch 2/3 already uses one
for IRQs. We won't need to bump up the IDs :)

-- 
viresh
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V3 1/3] gpio: Add virtio-gpio driver

2021-06-10 Thread Viresh Kumar

On 10-06-21, 19:40, Jean-Philippe Brucker wrote:
> On Thu, Jun 10, 2021 at 12:16:46PM +, Viresh Kumar via Stratos-dev wrote:

Fixed everything else you suggested.

> > +struct virtio_gpio_config {
> > +   char name[32];
> > +   __u16 ngpio;
> > +   __u16 padding;
> > +   __u32 gpio_names_size;
> > +   char gpio_names[0];
> 
> A variable-size array here will make it very difficult to append new
> fields to virtio_gpio_config, when adding new features to the device. (New
> fields cannot be inserted in between, since older drivers will expect
> existing fields at a specific offset.)

Yes, I thought about that earlier and though maybe we will be able to
play with that using the virtio-features, I mean a different layout of
config structure if we really need to add a field in config, based on
the feature flag.

> You could replace it with a reference to the string array, for example
> "__u16 gpio_names_offset" declaring the offset between the beginning of
> device-specific config and the string array.

But, I like this idea more and it does make it very flexible. Will
adapt to it.

> The 'name' field could also be indirect to avoid setting a fixed
> 32-char size, but that's not as important.

Yeah, 32 bytes is really enough. One won't be able to make any sense
out of a bigger name anyway :)

> > +} __packed;
> 
> No need for __packed, because the fields are naturally aligned (as
> required by the virtio spec)

Yeah, I know, but I tend to add that for structures which aren't very
simple (like the request/response ones), just to avoid human errors
and hours of debugging someone need to go through. __packed won't harm
at least :)

-- 
viresh
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V3 2/3] gpio: virtio: Add IRQ support

2021-06-10 Thread Linus Walleij

Hi Viresh!

thanks for this interesting patch!

On Thu, Jun 10, 2021 at 2:16 PM Viresh Kumar  wrote:

> This patch adds IRQ support for the virtio GPIO driver. Note that this
> uses the irq_bus_lock/unlock() callbacks since the operations over
> virtio can sleep.
>
> Signed-off-by: Viresh Kumar 

>  drivers/gpio/gpio-virtio.c   | 256 ++-
>  include/uapi/linux/virtio_gpio.h |  15 ++

You also need to
select GPIOLIB_IRQCHIP
in the Kconfig entry for the gpio-virtio driver, because you make
use of it.

> +static void virtio_gpio_irq_mask(struct irq_data *d)
> +{
> +   struct gpio_chip *gc = irq_data_to_gpio_chip(d);
> +   struct virtio_gpio *vgpio = gpio_chip_to_vgpio(gc);
> +   struct vgpio_line *line = &vgpio->lines[d->hwirq];
> +
> +   line->masked = true;
> +   line->masked_pending = true;
> +}
> +
> +static void virtio_gpio_irq_unmask(struct irq_data *d)
> +{
> +   struct gpio_chip *gc = irq_data_to_gpio_chip(d);
> +   struct virtio_gpio *vgpio = gpio_chip_to_vgpio(gc);
> +   struct vgpio_line *line = &vgpio->lines[d->hwirq];
> +
> +   line->masked = false;
> +   line->masked_pending = true;
> +}

This looks dangerous in combination with this:

> +static void virtio_gpio_interrupt(struct virtqueue *vq)
> +{
(...)
> +   local_irq_disable();
> +   ret = generic_handle_irq(irq);
> +   local_irq_enable();

Nominally slow IRQs like those being marshalled over
virtio should be nested, handle_nested_irq(irq);
but are they? Or are they just quite slow not super slow?

If a threaded IRQF_ONESHOT was requested the
IRQ core will kick the thread and *MASK* this IRQ,
which means it will call back to your .irq_mask() function
and expect it to be masked from this
point.

But the IRQ will not actually be masked until you issue
your callbacks in the .irq_bus_sync_unlock() callback
right?

So from this point until .irq_bus_sync_unlock()
get called and actually mask the IRQ, it could be
fired again? I suppose the IRQ handler is reentrant?
This would violate the API.

I would say that from this point and until you sync
you need a spinlock or other locking primitive to
stop this IRQ from fireing again, and a spinlock will
imply local_irq_disable() so this gets really complex.

I would say only using nesting IRQs or guarantee this
some other way, one way would be to specify that
whatever is at the other side of virtio cannot send another
GPIO IRQ message before the last one is handled,
so you would need to send a specific (new)
VIRTIO_GPIO_REQ_IRQ_ACK after all other messages
have been sent in .irq_bus_sync_unlock()
so that the next GPIO IRQ can be dispatched after that.

(Is this how messaged signalled interrupts work? No idea.
When in doubt ask the IRQ maintainers.)

Thanks,
Linus Walleij
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V3 1/3] gpio: Add virtio-gpio driver

2021-06-10 Thread Linus Walleij

Hi Viresh!

thanks for working on this, it's a really interesting driver.

My first question is conceptual:

We previously have Geerts driver for virtualization:
drivers/gpio/gpio-aggregator.c

The idea with the aggregator is that a host script sets up a
unique gpiochip for the virtualized instance using some poking
in sysfs and pass that to the virtual machine.
So this is Linux acting as virtualization host by definition.

I think virtio is more abstract and intended for the usecase
where the hypervisor is not Linux, so this should be mentioned
in the commit, possibly also in Kconfig so users immediately
know what usecases the two different drivers are for.

Possibly both could be used: aggregator to pick out the GPIOs
you want into a synthetic GPIO chip, and the actual talk
between the hypervisor/host and the guest using virtio, even
with linux-on-linux.

Yet another usecase would be to jit this with remoteproc/rpmsg
and let a specific signal processor or real-time executive on
another CPU with a few GPIOs around present these to
Linux using this mechanism. Well that would certainly interest
Bjorn and other rpmsg stakeholders, so they should have
a look so that this provides what they need they day they
need it. (CCed Bjorn and also Google who may want this for
their Android emulators.)

On Thu, Jun 10, 2021 at 2:16 PM Viresh Kumar  wrote:

> +static const char **parse_gpio_names(struct virtio_device *vdev,
> +  struct virtio_gpio_config *config)

I really like this end-to-end plug-and-play that even provides
the names over virtio.

I think my patch to the gpiolib to make it mandatory for names to
be unique per-chip made it in, but please make that part of the spec
so that we don't get the problem with non-unique names here.

I suppose the spec can be augmented later to also accept config
settings like open drain pull up/down etc but no need to specify
more than the basic for now.

But to be able to add more in the future, the client needs some
kind of query mechanism or version number so the driver can
adapt and not announce something the underlying virtio device
cannot do. Do we have this? A bitmask for features, a version
number that increase monotonically for new features to be
presented or similar?

Because otherwise we have to bump this:
+#define VIRTIO_ID_GPIO 41 /* virtio GPIO */

every time we add something new (and we will).

Yours,
Linus Walleij
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

RE: [PATCH 02/19] asm-generic/hyperv: convert hyperv statuses to strings

2021-06-10 Thread Sunil Muthuswamy via Virtualization




> -Original Message-
> From: Nuno Das Neves 
> Sent: Friday, May 28, 2021 3:43 PM
> To: linux-hyp...@vger.kernel.org; linux-ker...@vger.kernel.org
> Cc: virtualization@lists.linux-foundation.org; Michael Kelley 
> ; virem...@linux.microsoft.com; Sunil
> Muthuswamy ; wei@kernel.org; vkuznets 
> ; Lillian Grassin-Drake
> ; KY Srinivasan 
> Subject: [PATCH 02/19] asm-generic/hyperv: convert hyperv statuses to strings
> 
> Allow hyperv hypercall failures to be debugged more easily with dmesg.
> This will be used in the mshv module.
> 
> Signed-off-by: Nuno Das Neves 
> ---
>  arch/x86/hyperv/hv_init.c |  2 +-
>  arch/x86/hyperv/hv_proc.c | 10 +++---
>  include/asm-generic/hyperv-tlfs.h | 52 ++-
>  include/asm-generic/mshyperv.h|  8 +
>  4 files changed, 44 insertions(+), 28 deletions(-)
> 
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index bb0ae4b5c00f..722bafdb2225 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -349,7 +349,7 @@ static void __init hv_get_partition_id(void)
>   status = hv_do_hypercall(HVCALL_GET_PARTITION_ID, NULL, output_page);
>   if (!hv_result_success(status)) {
>   /* No point in proceeding if this failed */
> - pr_err("Failed to get partition ID: %lld\n", status);
> + pr_err("Failed to get partition ID: %s\n", 
> hv_status_to_string(status));
>   BUG();
>   }
>   hv_current_partition_id = output_page->partition_id;
> diff --git a/arch/x86/hyperv/hv_proc.c b/arch/x86/hyperv/hv_proc.c
> index 59cf9a9e0975..30951e778577 100644
> --- a/arch/x86/hyperv/hv_proc.c
> +++ b/arch/x86/hyperv/hv_proc.c
> @@ -117,7 +117,7 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 
> num_pages)
>page_count, 0, input_page, NULL);
>   local_irq_restore(flags);
>   if (!hv_result_success(status)) {
> - pr_err("Failed to deposit pages: %lld\n", status);
> + pr_err("Failed to deposit pages: %s\n", 
> hv_status_to_string(status));
>   ret = hv_status_to_errno(status);
>   goto err_free_allocations;
>   }
> @@ -172,8 +172,8 @@ int hv_call_add_logical_proc(int node, u32 lp_index, u32 
> apic_id)
> 
>   if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
>   if (!hv_result_success(status)) {
> - pr_err("%s: cpu %u apic ID %u, %lld\n", 
> __func__,
> -lp_index, apic_id, status);
> + pr_err("%s: cpu %u apic ID %u, %s\n", __func__,
> +lp_index, apic_id, 
> hv_status_to_string(status));
>   ret = hv_status_to_errno(status);
>   }
>   break;
> @@ -222,8 +222,8 @@ int hv_call_create_vp(int node, u64 partition_id, u32 
> vp_index, u32 flags)
> 
>   if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
>   if (!hv_result_success(status)) {
> - pr_err("%s: vcpu %u, lp %u, %lld\n", __func__,
> -vp_index, flags, status);
> + pr_err("%s: vcpu %u, lp %u, %s\n", __func__,
> +vp_index, flags, 
> hv_status_to_string(status));
>   ret = hv_status_to_errno(status);
>   }
>   break;
> diff --git a/include/asm-generic/hyperv-tlfs.h 
> b/include/asm-generic/hyperv-tlfs.h
> index fe6d41d0b114..40ff7cdd4a2b 100644
> --- a/include/asm-generic/hyperv-tlfs.h
> +++ b/include/asm-generic/hyperv-tlfs.h
> @@ -189,28 +189,36 @@ enum HV_GENERIC_SET_FORMAT {
>  #define HV_HYPERCALL_REP_START_MASK  GENMASK_ULL(59, 48)
> 
>  /* hypercall status code */
> -#define HV_STATUS_SUCCESS0x0
> -#define HV_STATUS_INVALID_HYPERCALL_CODE 0x2
> -#define HV_STATUS_INVALID_HYPERCALL_INPUT0x3
> -#define HV_STATUS_INVALID_ALIGNMENT  0x4
> -#define HV_STATUS_INVALID_PARAMETER  0x5
> -#define HV_STATUS_ACCESS_DENIED  0x6
> -#define HV_STATUS_INVALID_PARTITION_STATE0x7
> -#define HV_STATUS_OPERATION_DENIED   0x8
> -#define HV_STATUS_UNKNOWN_PROPERTY   0x9
> -#define HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE0xA
> -#define HV_STATUS_INSUFFICIENT_MEMORY0xB
> -#define HV_STATUS_INVALID_PARTITION_ID   0xD
> -#define HV_STATUS_INVALID_VP_INDEX   0xE
> -#define HV_STATUS_NOT_FOUND  0x10
> -#define HV_STATUS_INVALID_PORT_ID0x11
> -#define HV_STATUS_INVALID_CONNECTION_ID  0x12
> -#define HV_STATUS_INSUFFICIENT_BUFFERS   0x13
> -#define HV_STATUS_NOT_ACKNOWLEDGED   0x14
> -#define HV_STATUS_INVALID_VP_STATE   0x15
> -#define HV_STATUS_NO_RESOURC

[RFC PATCH v1] vsock: add mergeable rx buffer description

2021-06-10 Thread Jiang Wang

Mergeable rx buffer is already supported by virtio-net, and
it can save memory for big packets. It will also be beneficial
for the vsock devices, so add it to the spec.

---
V0 -> V1: I send similar patch with vsock dgram before and 
already got some comments. This version fixed those,such as
use present tense for feature bit etc. Also the feature bit
value is 3, because we expect to have some other featue bits
defined soon.

 virtio-vsock.tex | 78 +++-
 1 file changed, 77 insertions(+), 1 deletion(-)

diff --git a/virtio-vsock.tex b/virtio-vsock.tex
index da7e641..d529291 100644
--- a/virtio-vsock.tex
+++ b/virtio-vsock.tex
@@ -16,7 +16,9 @@ \subsection{Virtqueues}\label{sec:Device Types / Socket 
Device / Virtqueues}
 
 \subsection{Feature bits}\label{sec:Device Types / Socket Device / Feature 
bits}
 
-There are currently no feature bits defined for this device.
+\begin{description}
+\item[VIRTIO_VSOCK_F_MRG_RXBUF (3)] Driver can merge receive buffers.
+\end{description}
 
 \subsection{Device configuration layout}\label{sec:Device Types / Socket 
Device / Device configuration layout}
 
@@ -64,6 +66,8 @@ \subsection{Device Operation}\label{sec:Device Types / Socket 
Device / Device Op
 
 Packets transmitted or received contain a header before the payload:
 
+If feature VIRTIO_VSOCK_F_MRG_RXBUF is not negotiated, use the following 
header.
+
 \begin{lstlisting}
 struct virtio_vsock_hdr {
le64 src_cid;
@@ -79,6 +83,15 @@ \subsection{Device Operation}\label{sec:Device Types / 
Socket Device / Device Op
 };
 \end{lstlisting}
 
+If feature VIRTIO_VSOCK_F_MRG_RXBUF is negotiated, use the following header.
+\begin{lstlisting}
+struct virtio_vsock_hdr_mrg_rxbuf {
+   struct virtio_vsock_hdr hdr;
+   le16 num_buffers;
+};
+\end{lstlisting}
+
+
 The upper 32 bits of src_cid and dst_cid are reserved and zeroed.
 
 Most packets simply transfer data but control packets are also used for
@@ -170,6 +183,23 @@ \subsubsection{Buffer Space Management}\label{sec:Device 
Types / Socket Device /
 previously receiving a VIRTIO_VSOCK_OP_CREDIT_REQUEST packet. This allows
 communicating updates any time a change in buffer space occurs.
 
+\drivernormative{\paragraph}{Device Operation: Buffer Space Management}{Device 
Types / Socket Device / Device Operation / Setting Up Receive Buffers}
+\begin{itemize}
+\item If VIRTIO_VSOCK_F_MRG_RXBUF is negotiated, each buffer MUST be at
+least the size of the struct virtio_vsock_hdr_mgr_rxbuf.
+\end{itemize}
+
+\begin{note}
+Obviously each buffer can be split across multiple descriptor elements.
+\end{note}
+
+\devicenormative{\paragraph}{Device Operation: Buffer Space Management}{Device 
Types / Socket Device / Device Operation / Setting Up Receive Buffers}
+The device MUST set \field{num_buffers} to the number of descriptors used when
+transmitting the  packet.
+
+The device MUST use only a single descriptor if VIRTIO_VSOCK_F_MRG_RXBUF
+is not negotiated.
+
 \drivernormative{\paragraph}{Device Operation: Buffer Space Management}{Device 
Types / Socket Device / Device Operation / Buffer Space Management}
 VIRTIO_VSOCK_OP_RW data packets MUST only be transmitted when the peer has
 sufficient free buffer space for the payload.
@@ -188,6 +218,7 @@ \subsubsection{Receive and Transmit}\label{sec:Device Types 
/ Socket Device / De
 The driver queues outgoing packets on the tx virtqueue and incoming packet
 receive buffers on the rx virtqueue. Packets are of the following form:
 
+If VIRTIO_VSOCK_F_MRG_RXBUF is not negotiated, use the following.
 \begin{lstlisting}
 struct virtio_vsock_packet {
 struct virtio_vsock_hdr hdr;
@@ -195,9 +226,41 @@ \subsubsection{Receive and Transmit}\label{sec:Device 
Types / Socket Device / De
 };
 \end{lstlisting}
 
+Otherwise, use the following form:
+\begin{lstlisting}
+struct virtio_vsock_packet_mrg_rxbuf {
+struct virtio_vsock_hdr_mrg_rxbuf hdr;
+u8 data[];
+};
+\end{lstlisting}
+
 Virtqueue buffers for outgoing packets are read-only. Virtqueue buffers for
 incoming packets are write-only.
 
+When transmitting packets to the device, \field{num_buffers} is not used.
+
+\begin{enumerate}
+\item \field{num_buffers} indicates how many descriptors
+  this packet is spread over (including this one). 
+  This is valid only if VIRTIO_VSOCK_F_MRG_RXBUF is negotiated.
+  This allows receipt of large packets without having to allocate large
+  buffers: a packet that does not fit in a single buffer can flow
+  over to the next buffer, and so on. In this case, there will be
+  at least \field{num_buffers} used buffers in the virtqueue, and the device
+  chains them together to form a single packet in a way similar to
+  how it would store it in a single buffer spread over multiple
+  descriptors.
+  The other buffers will not begin with a struct virtio_vsock_hdr.
+
+  If VIRTIO_VSOCK_F_MRG_RXBUF is not negotiated, then only one
+  descriptor is used.
+
+\item If
+  \field{num_buffers} is one,

RE: [PATCH 01/19] x86/hyperv: convert hyperv statuses to linux error codes

2021-06-10 Thread Sunil Muthuswamy via Virtualization

> -Original Message-
> From: Nuno Das Neves 
> Sent: Friday, May 28, 2021 3:43 PM
> To: linux-hyp...@vger.kernel.org; linux-ker...@vger.kernel.org
> Cc: virtualization@lists.linux-foundation.org; Michael Kelley 
> ; virem...@linux.microsoft.com; Sunil
> Muthuswamy ; wei@kernel.org; vkuznets 
> ; Lillian Grassin-Drake
> ; KY Srinivasan 
> Subject: [PATCH 01/19] x86/hyperv: convert hyperv statuses to linux error 
> codes
> 
> Return linux-friendly error codes from hypercall wrapper functions.
> This will be needed in the mshv module.
> 
> Signed-off-by: Nuno Das Neves 
> ---
>  arch/x86/hyperv/hv_proc.c | 30 ++---
>  arch/x86/include/asm/mshyperv.h   |  1 +
>  include/asm-generic/hyperv-tlfs.h | 32 +--
>  3 files changed, 50 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/hyperv/hv_proc.c b/arch/x86/hyperv/hv_proc.c
> index 68a0843d4750..59cf9a9e0975 100644
> --- a/arch/x86/hyperv/hv_proc.c
> +++ b/arch/x86/hyperv/hv_proc.c
> @@ -14,6 +14,30 @@
> 
>  #include 
> 
> +int hv_status_to_errno(u64 hv_status)
> +{
> + switch (hv_result(hv_status)) {
> + case HV_STATUS_SUCCESS:
> + return 0;
> + case HV_STATUS_INVALID_PARAMETER:
> + case HV_STATUS_UNKNOWN_PROPERTY:
> + case HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE:
> + case HV_STATUS_INVALID_VP_INDEX:
> + case HV_STATUS_INVALID_REGISTER_VALUE:
> + case HV_STATUS_INVALID_LP_INDEX:
> + return -EINVAL;
> + case HV_STATUS_ACCESS_DENIED:
> + case HV_STATUS_OPERATION_DENIED:
> + return -EACCES;
> + case HV_STATUS_NOT_ACKNOWLEDGED:
> + case HV_STATUS_INVALID_VP_STATE:
> + case HV_STATUS_INVALID_PARTITION_STATE:
> + return -EBADFD;
> + }
> + return -ENOTRECOVERABLE;
> +}
> +EXPORT_SYMBOL_GPL(hv_status_to_errno);
> +
>  /*
>   * See struct hv_deposit_memory. The first u64 is partition ID, the rest
>   * are GPAs.
> @@ -94,7 +118,7 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 
> num_pages)
>   local_irq_restore(flags);
>   if (!hv_result_success(status)) {
>   pr_err("Failed to deposit pages: %lld\n", status);
> - ret = hv_result(status);
> + ret = hv_status_to_errno(status);
>   goto err_free_allocations;
>   }
> 
> @@ -150,7 +174,7 @@ int hv_call_add_logical_proc(int node, u32 lp_index, u32 
> apic_id)
>   if (!hv_result_success(status)) {
>   pr_err("%s: cpu %u apic ID %u, %lld\n", 
> __func__,
>  lp_index, apic_id, status);
> - ret = hv_result(status);
> + ret = hv_status_to_errno(status);
>   }
>   break;
>   }
> @@ -200,7 +224,7 @@ int hv_call_create_vp(int node, u64 partition_id, u32 
> vp_index, u32 flags)
>   if (!hv_result_success(status)) {
>   pr_err("%s: vcpu %u, lp %u, %lld\n", __func__,
>  vp_index, flags, status);
> - ret = hv_result(status);
> + ret = hv_status_to_errno(status);
>   }
>   break;
>   }
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index 67ff0d637e55..c6eb01f3864d 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -169,6 +169,7 @@ int hyperv_flush_guest_mapping_range(u64 as,
>  int hyperv_fill_flush_guest_mapping_list(
>   struct hv_guest_mapping_flush_list *flush,
>   u64 start_gfn, u64 end_gfn);
> +int hv_status_to_errno(u64 hv_status);
> 
>  extern bool hv_root_partition;
> 
> diff --git a/include/asm-generic/hyperv-tlfs.h 
> b/include/asm-generic/hyperv-tlfs.h
> index 515c3fb06ab3..fe6d41d0b114 100644
> --- a/include/asm-generic/hyperv-tlfs.h
> +++ b/include/asm-generic/hyperv-tlfs.h
> @@ -189,16 +189,28 @@ enum HV_GENERIC_SET_FORMAT {
>  #define HV_HYPERCALL_REP_START_MASK  GENMASK_ULL(59, 48)
> 
>  /* hypercall status code */
> -#define HV_STATUS_SUCCESS0
> -#define HV_STATUS_INVALID_HYPERCALL_CODE 2
> -#define HV_STATUS_INVALID_HYPERCALL_INPUT3
> -#define HV_STATUS_INVALID_ALIGNMENT  4
> -#define HV_STATUS_INVALID_PARAMETER  5
> -#define HV_STATUS_OPERATION_DENIED   8
> -#define HV_STATUS_INSUFFICIENT_MEMORY11
> -#define HV_STATUS_INVALID_PORT_ID17
> -#define HV_STATUS_INVALID_CONNECTION_ID  18
> -#define HV_STATUS_INSUFFICIENT_BUFFERS   19
> +#define HV_STATUS_SUCCESS0x0
> +#define HV_STATUS_INVALID_HYPERCALL_CODE 0x2
> +#define HV_STATUS_INVALID_HYPERCALL_INPUT0x3
> +#define HV_STATUS_INVALID_ALIGNMENT  0x4
> +#define HV_STATUS_INVALID_PARAMETER  0x5
> +#

[PATCH v5] virtio-vsock: add description for datagram type

2021-06-10 Thread Jiang Wang

Add supports for datagram type for virtio-vsock. Datagram
sockets are connectionless and unreliable. To avoid contention
with stream and other sockets, add two more virtqueues and
a new feature bit to identify if those two new queues exist or not.

Also add descriptions for resource management of datagram, which
does not use the existing credit update mechanism associated with
stream sockets.

Signed-off-by: Jiang Wang 
---

V2: addressed the comments for the previous version.
V3: add description for the mergeable receive buffer.
V4: add a feature bit for stream and reserver a bit for seqpacket.
Fix mrg_rxbuf related sentences.
V5: removed mergeable rx buffer part. It will go to a 
separate patch. Fixed comments about tx, rx, feature bit etc.

 virtio-vsock.tex | 71 +++-
 1 file changed, 60 insertions(+), 11 deletions(-)

diff --git a/virtio-vsock.tex b/virtio-vsock.tex
index da7e641..26a62ac 100644
--- a/virtio-vsock.tex
+++ b/virtio-vsock.tex
@@ -9,14 +9,37 @@ \subsection{Device ID}\label{sec:Device Types / Socket Device 
/ Device ID}
 
 \subsection{Virtqueues}\label{sec:Device Types / Socket Device / Virtqueues}
 \begin{description}
-\item[0] rx
-\item[1] tx
+\item[0] stream rx
+\item[1] stream tx
+\item[2] datagram rx
+\item[3] datagram tx
+\item[4] event
+\end{description}
+The virtio socket device uses 5 queues if feature bit VIRTIO_VSOCK_F_DRGAM is 
set. Otherwise, it
+only uses 3 queues, as the following.
+
+\begin{description}
+\item[0] stream rx
+\item[1] stream tx
 \item[2] event
 \end{description}
 
+When behavior differs between stream and datagram rx/tx virtqueues
+their full names are used. Common behavior is simply described in
+terms of rx/tx virtqueues and applies to both stream and datagram
+virtqueues.
+
 \subsection{Feature bits}\label{sec:Device Types / Socket Device / Feature 
bits}
 
-There are currently no feature bits defined for this device.
+\begin{description}
+\item[VIRTIO_VSOCK_F_STREAM (0)] Device has support for stream socket type.
+\end{description}
+
+\begin{description}
+\item[VIRTIO_VSOCK_F_DGRAM (2)] Device has support for datagram socket type.
+\end{description}
+
+If no feature bits are defined, assume device only supports stream socket type.
 
 \subsection{Device configuration layout}\label{sec:Device Types / Socket 
Device / Device configuration layout}
 
@@ -107,6 +130,9 @@ \subsection{Device Operation}\label{sec:Device Types / 
Socket Device / Device Op
 
 \subsubsection{Virtqueue Flow Control}\label{sec:Device Types / Socket Device 
/ Device Operation / Virtqueue Flow Control}
 
+Flow control applies to stream sockets; datagram sockets do not have
+flow control.
+
 The tx virtqueue carries packets initiated by applications and replies to
 received packets.  The rx virtqueue carries packets initiated by the device and
 replies to previously transmitted packets.
@@ -140,12 +166,15 @@ \subsubsection{Addressing}\label{sec:Device Types / 
Socket Device / Device Opera
 consists of a (cid, port number) tuple. The header fields used for this are
 \field{src_cid}, \field{src_port}, \field{dst_cid}, and \field{dst_port}.
 
-Currently only stream sockets are supported. \field{type} is 1 for stream
-socket types.
+Currently stream and datagram (dgram) sockets are supported. \field{type} is 1 
for stream
+socket types. \field{type} is 3 for dgram socket types.
 
 Stream sockets provide in-order, guaranteed, connection-oriented delivery
 without message boundaries.
 
+Datagram sockets provide unordered, unreliable, connectionless messages 
+with message boundaries and a maximum length.
+
 \subsubsection{Buffer Space Management}\label{sec:Device Types / Socket Device 
/ Device Operation / Buffer Space Management}
 \field{buf_alloc} and \field{fwd_cnt} are used for buffer space management of
 stream sockets. The guest and the device publish how much buffer space is
@@ -162,7 +191,7 @@ \subsubsection{Buffer Space Management}\label{sec:Device 
Types / Socket Device /
 u32 peer_free = peer_buf_alloc - (tx_cnt - peer_fwd_cnt);
 \end{lstlisting}
 
-If there is insufficient buffer space, the sender waits until virtqueue buffers
+For stream sockets, if there is insufficient buffer space, the sender waits 
until virtqueue buffers
 are returned and checks \field{buf_alloc} and \field{fwd_cnt} again. Sending
 the VIRTIO_VSOCK_OP_CREDIT_REQUEST packet queries how much buffer space is
 available. The reply to this query is a VIRTIO_VSOCK_OP_CREDIT_UPDATE packet.
@@ -170,22 +199,33 @@ \subsubsection{Buffer Space Management}\label{sec:Device 
Types / Socket Device /
 previously receiving a VIRTIO_VSOCK_OP_CREDIT_REQUEST packet. This allows
 communicating updates any time a change in buffer space occurs.
 
+Unlike stream sockets, dgram sockets do not use VIRTIO_VSOCK_OP_CREDIT_UPDATE 
or
+VIRTIO_VSOCK_OP_CREDIT_REQUEST packets. The dgram buffer management
+is split to two parts: sender side and receiver side. For the sender side, if 
the 
+v

Re: [PATCH V3 1/3] gpio: Add virtio-gpio driver

2021-06-10 Thread Jean-Philippe Brucker

Hi,

Not being very familiar with GPIO, I just have a few general comments and
one on the config space layout

On Thu, Jun 10, 2021 at 12:16:46PM +, Viresh Kumar via Stratos-dev wrote:
> +static int virtio_gpio_req(struct virtio_gpio *vgpio, u16 type, u16 gpio,
> +u8 txdata, u8 *rxdata)
> +{
> + struct virtio_gpio_response *res = &vgpio->cres;
> + struct virtio_gpio_request *req = &vgpio->creq;
> + struct scatterlist *sgs[2], req_sg, res_sg;
> + struct device *dev = &vgpio->vdev->dev;
> + unsigned long time_left;
> + unsigned int len;
> + int ret;
> +
> + req->type = cpu_to_le16(type);
> + req->gpio = cpu_to_le16(gpio);
> + req->data = txdata;
> +
> + sg_init_one(&req_sg, req, sizeof(*req));
> + sg_init_one(&res_sg, res, sizeof(*res));
> + sgs[0] = &req_sg;
> + sgs[1] = &res_sg;
> +
> + mutex_lock(&vgpio->lock);
> + ret = virtqueue_add_sgs(vgpio->command_vq, sgs, 1, 1, res, GFP_KERNEL);
> + if (ret) {
> + dev_err(dev, "failed to add request to vq\n");
> + goto out;
> + }
> +
> + reinit_completion(&vgpio->completion);
> + virtqueue_kick(vgpio->command_vq);
> +
> + time_left = wait_for_completion_timeout(&vgpio->completion, HZ / 10);
> + if (!time_left) {
> + dev_err(dev, "virtio GPIO backend timeout\n");
> + return -ETIMEDOUT;

mutex is still held

> + }
> +
> + WARN_ON(res != virtqueue_get_buf(vgpio->command_vq, &len));
> + if (unlikely(res->status != VIRTIO_GPIO_STATUS_OK)) {
> + dev_err(dev, "virtio GPIO request failed: %d\n", gpio);
> + return -EINVAL;

and here

> + }
> +
> + if (rxdata)
> + *rxdata = res->data;
> +
> +out:
> + mutex_unlock(&vgpio->lock);
> +
> + return ret;
> +}
> +
> +static int virtio_gpio_request(struct gpio_chip *gc, unsigned int gpio)
> +{
> + struct virtio_gpio *vgpio = gpio_chip_to_vgpio(gc);
> +
> + return virtio_gpio_req(vgpio, VIRTIO_GPIO_REQ_ACTIVATE, gpio, 0, NULL);
> +}
> +
> +static void virtio_gpio_free(struct gpio_chip *gc, unsigned int gpio)
> +{
> + struct virtio_gpio *vgpio = gpio_chip_to_vgpio(gc);
> +
> + virtio_gpio_req(vgpio, VIRTIO_GPIO_REQ_DEACTIVATE, gpio, 0, NULL);
> +}
> +
> +static int virtio_gpio_get_direction(struct gpio_chip *gc, unsigned int gpio)
> +{
> + struct virtio_gpio *vgpio = gpio_chip_to_vgpio(gc);
> + u8 direction;
> + int ret;
> +
> + ret = virtio_gpio_req(vgpio, VIRTIO_GPIO_REQ_GET_DIRECTION, gpio, 0,
> +   &direction);
> + if (ret)
> + return ret;
> +
> + return direction;
> +}
> +
> +static int virtio_gpio_direction_input(struct gpio_chip *gc, unsigned int 
> gpio)
> +{
> + struct virtio_gpio *vgpio = gpio_chip_to_vgpio(gc);
> +
> + return virtio_gpio_req(vgpio, VIRTIO_GPIO_REQ_DIRECTION_IN, gpio, 0,
> +NULL);
> +}
> +
> +static int virtio_gpio_direction_output(struct gpio_chip *gc, unsigned int 
> gpio,
> + int value)
> +{
> + struct virtio_gpio *vgpio = gpio_chip_to_vgpio(gc);
> +
> + return virtio_gpio_req(vgpio, VIRTIO_GPIO_REQ_DIRECTION_OUT, gpio, (u8)

(that dangling cast looks a bit odd to me)

> +value, NULL);
> +}
> +
> +static int virtio_gpio_get(struct gpio_chip *gc, unsigned int gpio)
> +{
> + struct virtio_gpio *vgpio = gpio_chip_to_vgpio(gc);
> + u8 value;
> + int ret;
> +
> + ret = virtio_gpio_req(vgpio, VIRTIO_GPIO_REQ_GET_VALUE, gpio, 0, 
> &value);
> + if (ret)
> + return ret;
> +
> + return value;
> +}
> +
> +static void virtio_gpio_set(struct gpio_chip *gc, unsigned int gpio, int 
> value)
> +{
> + struct virtio_gpio *vgpio = gpio_chip_to_vgpio(gc);
> +
> + virtio_gpio_req(vgpio, VIRTIO_GPIO_REQ_SET_VALUE, gpio, value, NULL);
> +}
> +
> +static void virtio_gpio_command(struct virtqueue *vq)
> +{
> + struct virtio_gpio *vgpio = vq->vdev->priv;
> +
> + complete(&vgpio->completion);
> +}
> +
> +static int virtio_gpio_alloc_vqs(struct virtio_device *vdev)
> +{
> + struct virtio_gpio *vgpio = vdev->priv;
> + const char * const names[] = { "command" };
> + vq_callback_t *cbs[] = {
> + virtio_gpio_command,
> + };
> + struct virtqueue *vqs[1] = {NULL};
> + int ret;
> +
> + ret = virtio_find_vqs(vdev, 1, vqs, cbs, names, NULL);
> + if (ret) {
> + dev_err(&vdev->dev, "failed to allocate vqs: %d\n", ret);
> + return ret;
> + }
> +
> + vgpio->command_vq = vqs[0];
> +
> + /* Mark the device ready to perform operations from within probe() */
> + virtio_device_ready(vgpio->vdev);

May fit better in the parent function

> + return 0;
> +}
> +
> +static void virtio_gpio_free_vqs(struct virtio_device *vdev)
> +{
> + vdev->config->reset(vdev);
> + vdev->config->del_vqs(vdev);
> +}
> +
> +

Re: [Stratos-dev] [PATCH V3 1/3] gpio: Add virtio-gpio driver

2021-06-10 Thread Jean-Philippe Brucker

On Thu, Jun 10, 2021 at 04:00:39PM +, Enrico Weigelt, metux IT consult via 
Stratos-dev wrote:
> On 10.06.21 15:22, Arnd Bergmann wrote:
> 
> > Can you give an example of how this would be hooked up to other drivers
> > using those gpios. Can you give an example of how using the "gpio-keys" or
> > "gpio-leds" drivers in combination with virtio-gpio looks like in the DT?
> 
> Connecting between self-probing bus'es and DT is generally tricky. IMHO
> we don't have any generic mechanism for that.

DT does have a generic description of PCI endpoints, which virtio-iommu
relies on to express the relation between IOMMU and endpoint nodes [1].
I think the problem here is similar: the client node needs a phandle to
the GPIO controller which may use virtio-pci transport?

Note that it mostly works if the device is on the root PCI bus. Behind a
bridge the OS may change the device's bus number as needed, so the BDF
reference in DT is only valid if the software providing the DT description
(VMM or firmware) initializes bus numbers accordingly (and I don't
remember if Linux supports this case well).

Thanks,
Jean

[1] Documentation/devicetree/bindings/virtio/iommu.txt

> 
> I've made a few attempts, but nothing practically useful, which would be
> accepted by the corresponding maintainers, yet. We'd either need some
> very special logic in DT probing or pseudo-bus'es for the mapping.
> (DT wants to do those connections via phandle's, which in turn need the
> referenced nodes to be present in the DT).
> 
> >  From what I can tell, both the mmio and pci variants of virtio can have 
> > their
> > dev->of_node populated, but I don't see the logic in 
> > register_virtio_device()
> > that looks up the of_node of the virtio_device that the of_gpio code then
> > tries to refer to.
> 
> Have you ever successfully bound a virtio device via DT ?
> 
> 
> --mtx
> 
> -- 
> ---
> Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
> werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
> GPG/PGP-Schlüssel zu.
> ---
> Enrico Weigelt, metux IT consult
> Free software and Linux embedded engineering
> i...@metux.net -- +49-151-27565287
> -- 
> Stratos-dev mailing list
> stratos-...@op-lists.linaro.org
> https://op-lists.linaro.org/mailman/listinfo/stratos-dev
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V3 1/3] gpio: Add virtio-gpio driver

2021-06-10 Thread Viresh Kumar

Hi Enrico,

On Thu, 10 Jun 2021 at 21:24, Enrico Weigelt, metux IT consult
 wrote:
> On 10.06.21 14:16, Viresh Kumar wrote:
> > From: "Enrico Weigelt, metux IT consult" 
> >
> > This patch adds a new driver for Virtio based GPIO devices.
> >
> > This allows a guest VM running Linux to access GPIO device provided by
> > the host. It supports all basic operations for now, except interrupts
> > for the GPIO lines.
>
> What exactly did you change here from my original driver ?

I didn't write it from scratch and used your patch only to start with, and so
you are still the Author of this particular patch.

This just followed the specification updates and so changes only the stuff
that was different from your original specs. Apart from that as you know,
the irqs weren't working in your case as they needed to be implemented
differently (patch 2 does that) here.

> Your already changed the spec havily (w/o consulting me first),

Isn't that what we are doing on the list? I believe that's why the lists
exist, so people don't need to discuss in person, offline. I am happy to
make changes in spec if you want to suggest something and if something
breaks it for you.

> so I guess this driver hasn't so much in common w/ my original design.

It actually has as I said earlier, it is still based on your patch.

> Note that I made my original design decisions for good reaons
> (see virtio-dev list).

I tried to follow your patches from December on the Linux kernel list
and the ID allocation one on virtio-comments list. I wasn't able to search
for any other attempt at sending the specification, so may have missed
something.

I do understand that there were reasons why you (and me) chose a
particular way of doing things and if there is a good reason of following
that, then we can still modify the spec and fix things for everyone.
We just need to discuss our reasoning on the list and get through with
a specfication which works for everyone as this will become a standard
interface for many in the future and it needs to be robust and efficient
for everyone.

> It already support async notifications
> (IOW: irqs), had an easy upgrade path for future additions, etc.

I haven't changed irqs path, we still have a separate virtqueue
(named "interrupt" vq) which handles just the interrupts now. And so
will be faster than what you initially suggested.

In your original design you also received the responses for the requests
on this virtqueue, wihch I have changed to get the response synchronously
on the "command" virtqueue only.

This is from one of your earlier replies:

"
I've been under the impression that queues only work in only one
direction. (at least that's what my web research was telling).

Could you please give an example how bi-directional transmission within
the same queue could look like ?
"

It is actually possible and the right thing to do here as we aren't starting
multiple requests together. The operation needs to be synchronous anyway
and both request/response on the same channel work better there. Also that
makes the interrupts reach faster, without additional delay due to responses.

> Note #2: it's already implemented and running in the field (different
> kernels, different hypervisors, ...) - it just lacked the going through
> virtio comitte's formal specification process, which blocked mainlining.

I understand the pain you would be reqiured to go through because of this,
but this is how any open source community will work, like Linux.

There will be changes in specification and code once you post it and any
solutions already working in the field won't really matter during the
discussion.
That is why it is always the right thing to _upstream first_, so one can avoid
these problems later on. Though I understand that the real world
needs to move faster than community. But no one can really help in that.

> Is there anything fundamentally wrong w/ my original design, why you
> invented a completely new one ?

Not much, and I haven't changed a lot as well.

Lemme point out few things which have changed in specification since
your earlier
version (the code just followed the specification, that's it).

- config structure
  - version information is replaced with virtio-features.
  - Irq support is added as a feature, as you suggested.
  - extra padding space (24 bytes) removed. If you see this patch we can
now read the entire config structure in a single go. Why should
anyone be required
to copy extra 24 bytes? If we need more fields later, we can
always do that with help
of another feature-flag. So this is still extendable.

- Virtqueues: we still have two virtqueues (command and interrupt),
just responses are
  moved to the command virtqueue only, as that is more efficient.

- Request/response: Request layout is same, added a new layout for response as
  the same layout is inefficient.

- IRQ support: Initial specification had no interface for configuring
irqs from the guest,
  I added that.

That's all I

Re: Re: [RFC v1 0/6] virtio/vsock: introduce SOCK_DGRAM support

2021-06-10 Thread Jiang Wang .

On Thu, Jun 10, 2021 at 2:52 AM Stefano Garzarella  wrote:
>
> On Thu, Jun 10, 2021 at 03:46:55PM +0800, Jason Wang wrote:
> >
> >在 2021/6/10 下午3:23, Stefano Garzarella 写道:
> >>On Thu, Jun 10, 2021 at 12:02:35PM +0800, Jason Wang wrote:
> >>>
> >>>在 2021/6/10 上午11:43, Jiang Wang . 写道:
> On Wed, Jun 9, 2021 at 6:51 PM Jason Wang  wrote:
> >
> >在 2021/6/10 上午7:24, Jiang Wang 写道:
> >>This patchset implements support of SOCK_DGRAM for virtio
> >>transport.
> >>
> >>Datagram sockets are connectionless and unreliable. To avoid
> >>unfair contention
> >>with stream and other sockets, add two more virtqueues and
> >>a new feature bit to indicate if those two new queues exist or not.
> >>
> >>Dgram does not use the existing credit update mechanism for
> >>stream sockets. When sending from the guest/driver, sending packets
> >>synchronously, so the sender will get an error when the
> >>virtqueue is full.
> >>When sending from the host/device, send packets asynchronously
> >>because the descriptor memory belongs to the corresponding QEMU
> >>process.
> >
> >What's the use case for the datagram vsock?
> >
> One use case is for non critical info logging from the guest
> to the host, such as the performance data of some applications.
> >>>
> >>>
> >>>Anything that prevents you from using the stream socket?
> >>>
> >>>
> 
> It can also be used to replace UDP communications between
> the guest and the host.
> >>>
> >>>
> >>>Any advantage for VSOCK in this case? Is it for performance (I
> >>>guess not since I don't exepct vsock will be faster).
> >>
> >>I think the general advantage to using vsock are for the guest
> >>agents that potentially don't need any configuration.
> >
> >
> >Right, I wonder if we really need datagram consider the host to guest
> >communication is reliable.
> >
> >(Note that I don't object it since vsock has already supported that,
> >just wonder its use cases)
>
> Yep, it was the same concern I had :-)
> Also because we're now adding SEQPACKET, which provides reliable
> datagram support.
>
> But IIUC the use case is the logging where you don't need a reliable
> communication and you want to avoid to keep more open connections with
> different guests.
>
> So the server in the host can be pretty simple and doesn't have to
> handle connections. It just waits for datagrams on a port.

Yes. With datagram sockets, the application code is simpler than the stream
sockets. Also, it will be easier to port existing applications written
for dgram,
such as UDP or unix domain socket with datagram types to the vsock
dgram sockets.

Compared to UDP, the vsock dgram has a minimum configuration. When
sending data from the guest to the host, the client in the guest knows
the host CID will always be 2. For UDP, the host IP may change depending
on the configuration.

The advantage over UNIX domain sockets is more obvious. We
have some applications talking to each other with UNIX domain sockets,
but now the applications are running inside VMs, so we will need to
use vsock and those applications use datagram types, so it is natural
and simpler if vsock has datagram types too.

And we can also run applications for vmware vsock dgram on
the QEMU directly.

btw, SEQPACKET also supports datagram, but the application code
logic is similar to stream sockets and the server needs to maintain
connections.

> >
> >
> >>
> >>>
> >>>An obvious drawback is that it breaks the migration. Using UDP you
> >>>can have a very rich features support from the kernel where vsock
> >>>can't.
> >>>
> >>
> >>Thanks for bringing this up!
> >>What features does UDP support and datagram on vsock could not support?
> >
> >
> >E.g the sendpage() and busy polling. And using UDP means qdiscs and
> >eBPF can work.
>
> Thanks, I see!
>
> >
> >
> >>
> >>>
> 
> >>The virtio spec patch is here:
> >>https://www.spinics.net/lists/linux-virtualization/msg50027.html
> >
> >Have a quick glance, I suggest to split mergeable rx buffer into an
> >separate patch.
> Sure.
> 
> >But I think it's time to revisit the idea of unifying the
> >virtio-net and
> >virtio-vsock. Otherwise we're duplicating features and bugs.
> For mergeable rxbuf related code, I think a set of common helper
> functions can be used by both virtio-net and virtio-vsock. For other
> parts, that may not be very beneficial. I will think about more.
> 
> If there is a previous email discussion about this topic, could
> you send me
> some links? I did a quick web search but did not find any related
> info. Thanks.
> >>>
> >>>
> >>>We had a lot:
> >>>
> >>>[1] 
> >>>https://patchwork.kernel.org/project/kvm/patch/5bdff537.3050...@huawei.com/
> >>>[2] 
> >>>https://lists.linuxfoundation.org/pipermail/virtualization/2018-November/039798.html
> >>>[3] https://www.lkml.org/lkml/2020/1/16/2043
> >>>
Got it. I will check, thanks.

>

Re: [PATCH 06/17] mshv: SynIC port and connection hypercalls

2021-06-10 Thread Vitaly Kuznetsov

Vineeth Pillai  writes:

> Hyper-V enables inter-partition communication through the port and
> connection constructs. More details about ports and connections in
> TLFS chapter 11.
>
> Implement hypercalls related to ports and connections for enabling
> inter-partiion communication.
>
> Signed-off-by: Vineeth Pillai 
> ---
>  drivers/hv/hv_call.c   | 161 +
>  drivers/hv/mshv.h  |  12 ++
>  include/asm-generic/hyperv-tlfs.h  |  55 +
>  include/linux/hyperv.h |   9 --
>  include/uapi/asm-generic/hyperv-tlfs.h |  76 
>  5 files changed, 304 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/hv/hv_call.c b/drivers/hv/hv_call.c
> index 025d4e2b892f..57db3a8ac94a 100644
> --- a/drivers/hv/hv_call.c
> +++ b/drivers/hv/hv_call.c
> @@ -742,3 +742,164 @@ int hv_call_translate_virtual_address(
>   return hv_status_to_errno(status);
>  }
>  
> +
> +int
> +hv_call_create_port(u64 port_partition_id, union hv_port_id port_id,
> + u64 connection_partition_id,
> + struct hv_port_info *port_info,
> + u8 port_vtl, u8 min_connection_vtl, int node)
> +{
> + struct hv_create_port *input;
> + unsigned long flags;
> + int ret = 0;
> + int status;
> +
> + do {
> + local_irq_save(flags);
> + input = (struct hv_create_port *)(*this_cpu_ptr(
> + hyperv_pcpu_input_arg));
> + memset(input, 0, sizeof(*input));
> +
> + input->port_partition_id = port_partition_id;
> + input->port_id = port_id;
> + input->connection_partition_id = connection_partition_id;
> + input->port_info = *port_info;
> + input->port_vtl = port_vtl;
> + input->min_connection_vtl = min_connection_vtl;
> + input->proximity_domain_info =
> + numa_node_to_proximity_domain_info(node);
> + status = hv_do_hypercall(HVCALL_CREATE_PORT, input,
> + NULL) & HV_HYPERCALL_RESULT_MASK;
> + local_irq_restore(flags);
> + if (status == HV_STATUS_SUCCESS)
> + break;
> +
> + if (status != HV_STATUS_INSUFFICIENT_MEMORY) {
> + pr_err("%s: %s\n",
> +__func__, hv_status_to_string(status));
> + ret = -hv_status_to_errno(status);

In Nuno's "x86/hyperv: convert hyperv statuses to linux error codes"
patch, hv_status_to_errno() already returns negatives:

+int hv_status_to_errno(u64 hv_status)
+{
+   switch (hv_result(hv_status)) {
+   case HV_STATUS_SUCCESS:
+   return 0;
+   case HV_STATUS_INVALID_PARAMETER:
+   case HV_STATUS_UNKNOWN_PROPERTY:
+   case HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE:
+   case HV_STATUS_INVALID_VP_INDEX:
+   case HV_STATUS_INVALID_REGISTER_VALUE:
+   case HV_STATUS_INVALID_LP_INDEX:
+   return -EINVAL;
+   case HV_STATUS_ACCESS_DENIED:
+   case HV_STATUS_OPERATION_DENIED:
+   return -EACCES;
+   case HV_STATUS_NOT_ACKNOWLEDGED:
+   case HV_STATUS_INVALID_VP_STATE:
+   case HV_STATUS_INVALID_PARTITION_STATE:
+   return -EBADFD;
+   }
+   return -ENOTRECOVERABLE;
+}
+EXPORT_SYMBOL_GPL(hv_status_to_errno);
+

> + break;
> + }
> + ret = hv_call_deposit_pages(NUMA_NO_NODE,
> + port_partition_id, 1);
> +
> + } while (!ret);
> +
> + return ret;
> +}
> +
> +int
> +hv_call_delete_port(u64 port_partition_id, union hv_port_id port_id)
> +{
> + union hv_delete_port input = { 0 };
> + unsigned long flags;
> + int status;
> +
> + local_irq_save(flags);
> + input.port_partition_id = port_partition_id;
> + input.port_id = port_id;
> + status = hv_do_fast_hypercall16(HVCALL_DELETE_PORT,
> + input.as_uint64[0],
> + input.as_uint64[1]) &
> + HV_HYPERCALL_RESULT_MASK;
> + local_irq_restore(flags);
> +
> + if (status != HV_STATUS_SUCCESS) {
> + pr_err("%s: %s\n", __func__, hv_status_to_string(status));
> + return -hv_status_to_errno(status);
> + }
> +
> + return 0;
> +}
> +
> +int
> +hv_call_connect_port(u64 port_partition_id, union hv_port_id port_id,
> +  u64 connection_partition_id,
> +  union hv_connection_id connection_id,
> +  struct hv_connection_info *connection_info,
> +  u8 connection_vtl, int node)
> +{
> + struct hv_connect_port *input;
> + unsigned long flags;
> + int ret = 0, status;
> +
> + do {
> + local_irq_save(flags);
> + input = (struct hv_connect_port *)(*this_cpu_ptr(
> + hyperv_pcpu_input

[PATCH V3 2/3] gpio: virtio: Add IRQ support

2021-06-10 Thread Viresh Kumar

This patch adds IRQ support for the virtio GPIO driver. Note that this
uses the irq_bus_lock/unlock() callbacks since the operations over
virtio can sleep.

Signed-off-by: Viresh Kumar 
---
 drivers/gpio/gpio-virtio.c   | 256 ++-
 include/uapi/linux/virtio_gpio.h |  15 ++
 2 files changed, 263 insertions(+), 8 deletions(-)

diff --git a/drivers/gpio/gpio-virtio.c b/drivers/gpio/gpio-virtio.c
index 1ba8ddf4693a..8bc4b1876961 100644
--- a/drivers/gpio/gpio-virtio.c
+++ b/drivers/gpio/gpio-virtio.c
@@ -20,6 +20,13 @@
 #include 
 #include 
 
+struct vgpio_line {
+   u8 irq_type;
+   bool irq_type_pending;
+   bool masked;
+   bool masked_pending;
+};
+
 struct virtio_gpio {
struct virtio_device *vdev;
struct mutex lock;
@@ -30,14 +37,20 @@ struct virtio_gpio {
struct virtio_gpio_request creq;
struct virtio_gpio_response cres;
 
+   struct irq_chip *ic;
+   struct vgpio_line *lines;
+   struct virtqueue *interrupt_vq;
+   struct virtio_gpio_request ireq;
+
/* This must be kept as the last entry here, hint: gpio_names[0] */
struct virtio_gpio_config config;
 };
 
 #define gpio_chip_to_vgpio(chip) container_of(chip, struct virtio_gpio, gc)
+#define irq_data_to_gpio_chip(d) (d->domain->host_data)
 
-static int virtio_gpio_req(struct virtio_gpio *vgpio, u16 type, u16 gpio,
-  u8 txdata, u8 *rxdata)
+static int virtio_gpio_req_unlocked(struct virtio_gpio *vgpio, u16 type,
+   u16 gpio, u8 txdata, u8 *rxdata)
 {
struct virtio_gpio_response *res = &vgpio->cres;
struct virtio_gpio_request *req = &vgpio->creq;
@@ -56,11 +69,10 @@ static int virtio_gpio_req(struct virtio_gpio *vgpio, u16 
type, u16 gpio,
sgs[0] = &req_sg;
sgs[1] = &res_sg;
 
-   mutex_lock(&vgpio->lock);
ret = virtqueue_add_sgs(vgpio->command_vq, sgs, 1, 1, res, GFP_KERNEL);
if (ret) {
dev_err(dev, "failed to add request to vq\n");
-   goto out;
+   return ret;
}
 
reinit_completion(&vgpio->completion);
@@ -81,7 +93,16 @@ static int virtio_gpio_req(struct virtio_gpio *vgpio, u16 
type, u16 gpio,
if (rxdata)
*rxdata = res->data;
 
-out:
+   return ret;
+}
+
+static int virtio_gpio_req(struct virtio_gpio *vgpio, u16 type, u16 gpio,
+  u8 txdata, u8 *rxdata)
+{
+   int ret;
+
+   mutex_lock(&vgpio->lock);
+   ret = virtio_gpio_req_unlocked(vgpio, type, gpio, txdata, rxdata);
mutex_unlock(&vgpio->lock);
 
return ret;
@@ -152,6 +173,183 @@ static void virtio_gpio_set(struct gpio_chip *gc, 
unsigned int gpio, int value)
virtio_gpio_req(vgpio, VIRTIO_GPIO_REQ_SET_VALUE, gpio, value, NULL);
 }
 
+/* Interrupt handling */
+static void vgpio_irq_mask(struct virtio_gpio *vgpio, u16 gpio)
+{
+   int ret;
+
+   ret = virtio_gpio_req_unlocked(vgpio, VIRTIO_GPIO_REQ_IRQ_MASK, gpio, 0,
+  NULL);
+   if (ret)
+   dev_err(&vgpio->vdev->dev, "failed to mask irq: %d\n", ret);
+}
+
+static void vgpio_irq_unmask(struct virtio_gpio *vgpio, u16 gpio)
+{
+   int ret;
+
+   ret = virtio_gpio_req_unlocked(vgpio, VIRTIO_GPIO_REQ_IRQ_UNMASK, gpio,
+  0, NULL);
+   if (ret)
+   dev_err(&vgpio->vdev->dev, "failed to unmask irq: %d\n", ret);
+}
+
+static void vgpio_irq_set_type(struct virtio_gpio *vgpio, u16 gpio, u8 type)
+{
+   int ret;
+
+   ret = virtio_gpio_req_unlocked(vgpio, VIRTIO_GPIO_REQ_IRQ_TYPE, gpio,
+  type, NULL);
+   if (ret)
+   dev_err(&vgpio->vdev->dev, "failed to set irq type: %d\n", ret);
+}
+
+static void virtio_gpio_irq_mask(struct irq_data *d)
+{
+   struct gpio_chip *gc = irq_data_to_gpio_chip(d);
+   struct virtio_gpio *vgpio = gpio_chip_to_vgpio(gc);
+   struct vgpio_line *line = &vgpio->lines[d->hwirq];
+
+   line->masked = true;
+   line->masked_pending = true;
+}
+
+static void virtio_gpio_irq_unmask(struct irq_data *d)
+{
+   struct gpio_chip *gc = irq_data_to_gpio_chip(d);
+   struct virtio_gpio *vgpio = gpio_chip_to_vgpio(gc);
+   struct vgpio_line *line = &vgpio->lines[d->hwirq];
+
+   line->masked = false;
+   line->masked_pending = true;
+}
+
+static int virtio_gpio_irq_set_type(struct irq_data *d, unsigned int type)
+{
+   struct gpio_chip *gc = irq_data_to_gpio_chip(d);
+   struct virtio_gpio *vgpio = gpio_chip_to_vgpio(gc);
+   struct vgpio_line *line = &vgpio->lines[d->hwirq];
+   u8 irq_type;
+
+   switch (type) {
+   case IRQ_TYPE_NONE:
+   irq_type = VIRTIO_GPIO_IRQ_TYPE_NONE;
+   break;
+   case IRQ_TYPE_EDGE_RISING:
+   irq_type = VIRTIO_GPIO_IRQ_TYPE_EDGE_RISING;
+   break;
+   case IRQ_TY

[PATCH V3 1/3] gpio: Add virtio-gpio driver

2021-06-10 Thread Viresh Kumar

From: "Enrico Weigelt, metux IT consult" 

This patch adds a new driver for Virtio based GPIO devices.

This allows a guest VM running Linux to access GPIO device provided by
the host. It supports all basic operations for now, except interrupts
for the GPIO lines.

Signed-off-by: "Enrico Weigelt, metux IT consult" 
Co-developed-by: Viresh Kumar 
Signed-off-by: Viresh Kumar 
---
 drivers/gpio/Kconfig |   9 +
 drivers/gpio/Makefile|   1 +
 drivers/gpio/gpio-virtio.c   | 326 +++
 include/uapi/linux/virtio_gpio.h |  41 
 include/uapi/linux/virtio_ids.h  |   1 +
 5 files changed, 378 insertions(+)
 create mode 100644 drivers/gpio/gpio-virtio.c
 create mode 100644 include/uapi/linux/virtio_gpio.h

diff --git a/drivers/gpio/Kconfig b/drivers/gpio/Kconfig
index e3607ec4c2e8..7f12fb65d130 100644
--- a/drivers/gpio/Kconfig
+++ b/drivers/gpio/Kconfig
@@ -1633,6 +1633,15 @@ config GPIO_MOCKUP
  tools/testing/selftests/gpio/gpio-mockup.sh. Reference the usage in
  it.
 
+config GPIO_VIRTIO
+   tristate "VirtIO GPIO support"
+   depends on VIRTIO
+   help
+ Say Y here to enable guest support for virtio-based GPIO controllers.
+
+ These virtual GPIOs can be routed to real GPIOs or attached to
+ simulators on the host (like QEMU).
+
 endmenu
 
 endif
diff --git a/drivers/gpio/Makefile b/drivers/gpio/Makefile
index c58a90a3c3b1..ace004c80871 100644
--- a/drivers/gpio/Makefile
+++ b/drivers/gpio/Makefile
@@ -162,6 +162,7 @@ obj-$(CONFIG_GPIO_UCB1400)  += gpio-ucb1400.o
 obj-$(CONFIG_GPIO_UNIPHIER)+= gpio-uniphier.o
 obj-$(CONFIG_GPIO_VF610)   += gpio-vf610.o
 obj-$(CONFIG_GPIO_VIPERBOARD)  += gpio-viperboard.o
+obj-$(CONFIG_GPIO_VIRTIO)  += gpio-virtio.o
 obj-$(CONFIG_GPIO_VISCONTI)+= gpio-visconti.o
 obj-$(CONFIG_GPIO_VR41XX)  += gpio-vr41xx.o
 obj-$(CONFIG_GPIO_VX855)   += gpio-vx855.o
diff --git a/drivers/gpio/gpio-virtio.c b/drivers/gpio/gpio-virtio.c
new file mode 100644
index ..1ba8ddf4693a
--- /dev/null
+++ b/drivers/gpio/gpio-virtio.c
@@ -0,0 +1,326 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * GPIO driver for virtio-based virtual GPIO controllers
+ *
+ * Copyright (C) 2021 metux IT consult
+ * Enrico Weigelt, metux IT consult 
+ *
+ * Copyright (C) 2021 Linaro.
+ * Viresh Kumar 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct virtio_gpio {
+   struct virtio_device *vdev;
+   struct mutex lock;
+   struct gpio_chip gc;
+
+   struct completion completion;
+   struct virtqueue *command_vq;
+   struct virtio_gpio_request creq;
+   struct virtio_gpio_response cres;
+
+   /* This must be kept as the last entry here, hint: gpio_names[0] */
+   struct virtio_gpio_config config;
+};
+
+#define gpio_chip_to_vgpio(chip) container_of(chip, struct virtio_gpio, gc)
+
+static int virtio_gpio_req(struct virtio_gpio *vgpio, u16 type, u16 gpio,
+  u8 txdata, u8 *rxdata)
+{
+   struct virtio_gpio_response *res = &vgpio->cres;
+   struct virtio_gpio_request *req = &vgpio->creq;
+   struct scatterlist *sgs[2], req_sg, res_sg;
+   struct device *dev = &vgpio->vdev->dev;
+   unsigned long time_left;
+   unsigned int len;
+   int ret;
+
+   req->type = cpu_to_le16(type);
+   req->gpio = cpu_to_le16(gpio);
+   req->data = txdata;
+
+   sg_init_one(&req_sg, req, sizeof(*req));
+   sg_init_one(&res_sg, res, sizeof(*res));
+   sgs[0] = &req_sg;
+   sgs[1] = &res_sg;
+
+   mutex_lock(&vgpio->lock);
+   ret = virtqueue_add_sgs(vgpio->command_vq, sgs, 1, 1, res, GFP_KERNEL);
+   if (ret) {
+   dev_err(dev, "failed to add request to vq\n");
+   goto out;
+   }
+
+   reinit_completion(&vgpio->completion);
+   virtqueue_kick(vgpio->command_vq);
+
+   time_left = wait_for_completion_timeout(&vgpio->completion, HZ / 10);
+   if (!time_left) {
+   dev_err(dev, "virtio GPIO backend timeout\n");
+   return -ETIMEDOUT;
+   }
+
+   WARN_ON(res != virtqueue_get_buf(vgpio->command_vq, &len));
+   if (unlikely(res->status != VIRTIO_GPIO_STATUS_OK)) {
+   dev_err(dev, "virtio GPIO request failed: %d\n", gpio);
+   return -EINVAL;
+   }
+
+   if (rxdata)
+   *rxdata = res->data;
+
+out:
+   mutex_unlock(&vgpio->lock);
+
+   return ret;
+}
+
+static int virtio_gpio_request(struct gpio_chip *gc, unsigned int gpio)
+{
+   struct virtio_gpio *vgpio = gpio_chip_to_vgpio(gc);
+
+   return virtio_gpio_req(vgpio, VIRTIO_GPIO_REQ_ACTIVATE, gpio, 0, NULL);
+}
+
+static void virtio_gpio_free(struct gpio_chip *gc, unsigned int gpio)
+{
+   struct virtio_gpio *vgpio = gpio_chip_to_vgpio(gc);
+
+   virtio_gpio_req(

[PATCH V3 0/3] gpio: Add virtio based driver

2021-06-10 Thread Viresh Kumar

Hello,

This adds a virtio based GPIO driver based on the proposed specification [1].

The first two versions [2] were sent by Enrico earlier and here is a newer 
version.
I have retained the authorship of the work done by Enrico (1st patch) to make
sure we don't loose his credits.

I have tested all basic operations of the patchset (with Qemu guest) with the
libgpiod utility. I have also tested the handling of interrupts on the guest
side. All works as expected.

I will now be working towards a Rust based hypervisor agnostic backend to
provide a generic solution for that.

This should be merged only after the specifications are merged, I will keep
everyone posted for the same.

I am not adding a version history here as a lot of changes have been made, from
protocol to code and this really needs a full review from scratch.

--
Viresh

[1] https://lists.oasis-open.org/archives/virtio-comment/202106/msg00022.html
[2] https://lore.kernel.org/linux-gpio/20201203191135.21576-2-i...@metux.net/

Enrico Weigelt, metux IT consult (1):
  gpio: Add virtio-gpio driver

Viresh Kumar (2):
  gpio: virtio: Add IRQ support
  MAINTAINERS: Add entry for Virtio-gpio

 MAINTAINERS  |   7 +
 drivers/gpio/Kconfig |   9 +
 drivers/gpio/Makefile|   1 +
 drivers/gpio/gpio-virtio.c   | 566 +++
 include/uapi/linux/virtio_gpio.h |  56 +++
 include/uapi/linux/virtio_ids.h  |   1 +
 6 files changed, 640 insertions(+)
 create mode 100644 drivers/gpio/gpio-virtio.c
 create mode 100644 include/uapi/linux/virtio_gpio.h

-- 
2.31.1.272.g89b43f80a514

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v4 3/6] x86/sev-es: Split up runtime #VC handler for correct state tracking

2021-06-10 Thread Joerg Roedel

Hi Peter,

On Thu, Jun 10, 2021 at 12:19:43PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 10, 2021 at 11:11:38AM +0200, Joerg Roedel wrote:
> 
> > +static void vc_handle_from_kernel(struct pt_regs *regs, unsigned long 
> > error_code)
> 
> static noinstr ...

Right, I forgot that, will update the patch and add the correct noinstr
annotations.

> > +   if (user_mode(regs))
> > +   vc_handle_from_user(regs, error_code);
> > +   else
> > +   vc_handle_from_kernel(regs, error_code);
> >  }
> 
> #DB and MCE use idtentry_mce_db and split out in asm. When I look at
> idtentry_vc, it appears to me that VC_SAFE_STACK already implies
> from-user, or am I reading that wrong?

VC_SAFE_STACK does not imply from-user. It means that the #VC handler
asm code was able to switch away from the IST stack to either the
task-stack (if from-user or syscall gap) or to the previous kernel
stack. There is a check in vc_switch_off_ist() that shows which stacks
are considered safe.

If it can not switch to a safe stack the VC entry code switches to the
fall-back stack and a special handler function is called, which for now
just panics the system.

> How about you don't do that and have exc_ call your new from_kernel
> function, then we know that safe_stack_ is always from-user. Then also
> maybe do:
> 
>   s/VS_SAFE_STACK/VC_USER/
>   s/safe_stack_/noist_/
> 
> to match all the others (#DB/MCE).

So #VC is different from #DB and #MCE in that it switches stacks even
when coming from kernel mode, so that the #VC handler can be nested.
What I can do is to call the from_user function directly from asm in
the .Lfrom_user_mode_switch_stack path. That will avoid having another
from_user check in C code.

> DEFINE_IDTENTRY_VC(exc_vc)
> {
>   if (unlikely(on_vc_fallback_stack(regs))) {
>   instrumentation_begin();
>   panic("boohooo\n");
>   instrumentation_end();

The on_vc_fallback_stack() path is for now only calling panic(), because
it can't be hit when the hypervisor is behaving correctly. In the future
it is not clear yet if that path needs to be extended for SNP page
validation exceptions, which can basically happen anywhere.

The implementation of SNP should make sure that all memory touched
during entry (while on unsafe stacks) is always validated, but not sure
yet if that holds when live-migration of SNP guests is added to the
picture.

There is the possibility that this doesn't fit in the above branch, but
it can also be moved to a separate function if needed.

>   }
> 
>   vc_from_kernel();
> }
> 
> DEFINE_IDTENTRY_VC_USER(exc_vc)
> {
>   vc_from_user();
> }
> 
> Which is, I'm thinking, much simpler, no?

Okay, I am going to try this out. Thanks for your feedback.

Regards,

Joerg
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v4 3/6] x86/sev-es: Split up runtime #VC handler for correct state tracking

2021-06-10 Thread Peter Zijlstra



Bah, I suppose the trouble is that this SEV crap requires PARAVIRT?

I should really get around to fixing noinstr validation with PARAVIRT on
:-(

On Thu, Jun 10, 2021 at 11:11:38AM +0200, Joerg Roedel wrote:

> +static void vc_handle_from_kernel(struct pt_regs *regs, unsigned long 
> error_code)

static noinstr ...

> +{
> + irqentry_state_t irq_state = irqentry_nmi_enter(regs);
>  
> + instrumentation_begin();
>  
> + if (!vc_raw_handle_exception(regs, error_code)) {
>   /* Show some debug info */
>   show_regs(regs);
>  
> @@ -1434,7 +1400,59 @@ DEFINE_IDTENTRY_VC_SAFE_STACK(exc_vmm_communication)
>   panic("Returned from Terminate-Request to Hypervisor\n");
>   }
>  
> + instrumentation_end();
> + irqentry_nmi_exit(regs, irq_state);
> +}
> +
> +static void vc_handle_from_user(struct pt_regs *regs, unsigned long 
> error_code)

static noinstr ...

> +{
> + irqentry_state_t irq_state = irqentry_enter(regs);
> +
> + instrumentation_begin();
> +
> + if (!vc_raw_handle_exception(regs, error_code)) {
> + /*
> +  * Do not kill the machine if user-space triggered the
> +  * exception. Send SIGBUS instead and let user-space deal with
> +  * it.
> +  */
> + force_sig_fault(SIGBUS, BUS_OBJERR, (void __user *)0);
> + }
> +
> + instrumentation_end();
> + irqentry_exit(regs, irq_state);
> +}

+ linebreak

> +/*
> + * Main #VC exception handler. It is called when the entry code was able to
> + * switch off the IST to a safe kernel stack.
> + *
> + * With the current implementation it is always possible to switch to a safe
> + * stack because #VC exceptions only happen at known places, like intercepted
> + * instructions or accesses to MMIO areas/IO ports. They can also happen with
> + * code instrumentation when the hypervisor intercepts #DB, but the critical
> + * paths are forbidden to be instrumented, so #DB exceptions currently also
> + * only happen in safe places.
> + */
> +DEFINE_IDTENTRY_VC_SAFE_STACK(exc_vmm_communication)
> +{
> + /*
> +  * Handle #DB before calling into !noinstr code to avoid recursive #DB.
> +  */
> + if (error_code == SVM_EXIT_EXCP_BASE + X86_TRAP_DB) {
> + vc_handle_trap_db(regs);
> + return;
> + }
> +
> + /*
> +  * This is invoked through an interrupt gate, so IRQs are disabled. The
> +  * code below might walk page-tables for user or kernel addresses, so
> +  * keep the IRQs disabled to protect us against concurrent TLB flushes.
> +  */
> +
> + if (user_mode(regs))
> + vc_handle_from_user(regs, error_code);
> + else
> + vc_handle_from_kernel(regs, error_code);
>  }

#DB and MCE use idtentry_mce_db and split out in asm. When I look at
idtentry_vc, it appears to me that VC_SAFE_STACK already implies
from-user, or am I reading that wrong?

Ah, it appears you're muddling things up again by then also calling
safe_stack_ from exc_.

How about you don't do that and have exc_ call your new from_kernel
function, then we know that safe_stack_ is always from-user. Then also
maybe do:

s/VS_SAFE_STACK/VC_USER/
s/safe_stack_/noist_/

to match all the others (#DB/MCE).

Also, AFAICT, you don't actually need DEFINE_IDTENTRY_VC_IST, it doesn't
have an ASM counterpart.

So then you end up with something like:

DEFINE_IDTENTRY_VC(exc_vc)
{
if (unlikely(on_vc_fallback_stack(regs))) {
instrumentation_begin();
panic("boohooo\n");
instrumentation_end();
}

vc_from_kernel();
}

DEFINE_IDTENTRY_VC_USER(exc_vc)
{
vc_from_user();
}

Which is, I'm thinking, much simpler, no?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC v1 0/6] virtio/vsock: introduce SOCK_DGRAM support

2021-06-10 Thread Stefano Garzarella


On Thu, Jun 10, 2021 at 03:46:55PM +0800, Jason Wang wrote:


在 2021/6/10 下午3:23, Stefano Garzarella 写道:

On Thu, Jun 10, 2021 at 12:02:35PM +0800, Jason Wang wrote:


在 2021/6/10 上午11:43, Jiang Wang . 写道:

On Wed, Jun 9, 2021 at 6:51 PM Jason Wang  wrote:


在 2021/6/10 上午7:24, Jiang Wang 写道:

This patchset implements support of SOCK_DGRAM for virtio
transport.

Datagram sockets are connectionless and unreliable. To avoid 
unfair contention

with stream and other sockets, add two more virtqueues and
a new feature bit to indicate if those two new queues exist or not.

Dgram does not use the existing credit update mechanism for
stream sockets. When sending from the guest/driver, sending packets
synchronously, so the sender will get an error when the 
virtqueue is full.

When sending from the host/device, send packets asynchronously
because the descriptor memory belongs to the corresponding QEMU
process.


What's the use case for the datagram vsock?


One use case is for non critical info logging from the guest
to the host, such as the performance data of some applications.



Anything that prevents you from using the stream socket?




It can also be used to replace UDP communications between
the guest and the host.



Any advantage for VSOCK in this case? Is it for performance (I 
guess not since I don't exepct vsock will be faster).


I think the general advantage to using vsock are for the guest 
agents that potentially don't need any configuration.



Right, I wonder if we really need datagram consider the host to guest 
communication is reliable.


(Note that I don't object it since vsock has already supported that, 
just wonder its use cases)


Yep, it was the same concern I had :-)
Also because we're now adding SEQPACKET, which provides reliable 
datagram support.


But IIUC the use case is the logging where you don't need a reliable 
communication and you want to avoid to keep more open connections with 
different guests.


So the server in the host can be pretty simple and doesn't have to 
handle connections. It just waits for datagrams on a port.









An obvious drawback is that it breaks the migration. Using UDP you 
can have a very rich features support from the kernel where vsock 
can't.




Thanks for bringing this up!
What features does UDP support and datagram on vsock could not support?



E.g the sendpage() and busy polling. And using UDP means qdiscs and 
eBPF can work.


Thanks, I see!











The virtio spec patch is here:
https://www.spinics.net/lists/linux-virtualization/msg50027.html


Have a quick glance, I suggest to split mergeable rx buffer into an
separate patch.

Sure.

But I think it's time to revisit the idea of unifying the 
virtio-net and

virtio-vsock. Otherwise we're duplicating features and bugs.

For mergeable rxbuf related code, I think a set of common helper
functions can be used by both virtio-net and virtio-vsock. For other
parts, that may not be very beneficial. I will think about more.

If there is a previous email discussion about this topic, could 
you send me

some links? I did a quick web search but did not find any related
info. Thanks.



We had a lot:

[1] https://patchwork.kernel.org/project/kvm/patch/5bdff537.3050...@huawei.com/
[2] 
https://lists.linuxfoundation.org/pipermail/virtualization/2018-November/039798.html
[3] https://www.lkml.org/lkml/2020/1/16/2043



When I tried it, the biggest problem that blocked me were all the 
features strictly related to TCP/IP stack and ethernet devices that 
vsock device doesn't know how to handle: TSO, GSO, checksums, MAC, 
napi, xdp, min ethernet frame size, MTU, etc.



It depends on which level we want to share:

1) sharing codes
2) sharing devices
3) make vsock a protocol that is understood by the network core

We can start from 1), the low level tx/rx logic can be shared at both 
virtio-net and vhost-net. For 2) we probably need some work on the 
spec, probably with a new feature bit to demonstrate that it's a vsock 
device not a ethernet device. Then if it is probed as a vsock device we 
won't let packet to be delivered in the TCP/IP stack. For 3), it would 
be even harder and I'm not sure it's worth to do that.





So in my opinion to unify them is not so simple, because vsock is not 
really an ethernet device, but simply a socket.



We can start from sharing codes.


Yep, I agree, and maybe the mergeable buffer is a good starting point to 
share code!


@Jiang, do you want to take a look of this possibility?

Thanks,
Stefano

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 00/19] Microsoft Hypervisor root partition ioctl interface

2021-06-10 Thread Vitaly Kuznetsov

Nuno Das Neves  writes:

> This patch series provides a userspace interface for creating and running 
> guest
> virtual machines while running on the Microsoft Hypervisor [0].
>
> Since managing guest machines can only be done when Linux is the root 
> partition,
> this series depends on Wei Liu's patch series merged in 5.12:
> https://lore.kernel.org/linux-hyperv/20210203150435.27941-1-wei@kernel.org/
>
> The first two patches provide some helpers for converting hypervisor status
> codes to linux error codes, and printing hypervisor status codes to dmesg for
> debugging.
>
> Hyper-V related headers asm-generic/hyperv-tlfs.h and x86/asm/hyperv-tlfs.h 
> are
> split into uapi and non-uapi. The uapi versions contain structures used in 
> both
> the ioctl interface and the kernel.
>
> The mshv API is introduced in drivers/hv/mshv_main.c. As each interface is
> introduced, documentation is added in Documentation/virt/mshv/api.rst.
> The API is file-desciptor based, like KVM. The entry point is /dev/mshv.
>
> /dev/mshv ioctls:
> MSHV_CHECK_EXTENSION
> MSHV_CREATE_PARTITION
>
> Partition (vm) ioctls:
> MSHV_MAP_GUEST_MEMORY, MSHV_UNMAP_GUEST_MEMORY
> MSHV_INSTALL_INTERCEPT
> MSHV_ASSERT_INTERRUPT
> MSHV_GET_PARTITION_PROPERTY, MSHV_SET_PARTITION_PROPERTY
> MSHV_CREATE_VP
>
> Vp (vcpu) ioctls:
> MSHV_GET_VP_REGISTERS, MSHV_SET_VP_REGISTERS
> MSHV_RUN_VP
> MSHV_GET_VP_STATE, MSHV_SET_VP_STATE
> MSHV_TRANSLATE_GVA
> mmap() (register page)
>
> [0] Hyper-V is more well-known, but it really refers to the whole stack
> including the hypervisor and other components that run in Windows kernel
> and userspace.
>
> Changes since RFC:
> 1. Moved code from virt/mshv to drivers/hv
> 2. Split hypercall helper functions and synic code to hv_call.c and hv_synic.c
> 3. MSHV_REQUEST_VERSION ioctl replaced with MSHV_CHECK_EXTENSION
> 3. Numerous suggestions, fixes, style changes, etc from Michael Kelley, Vitaly
>Kuznetsov, Wei Liu, and Vineeth Pillai
> 4. Added patch to enable hypervisor enlightenments on partition creation
> 5. Added Wei Liu's patch for GVA to GPA translation
>

Thank you for addressing these!

One nitpick though: could you please run your next submission through
'scripts/checkpatch.pl'? It reports a number of issues here, mostly
minor but still worth addressing, i.e.:

$ scripts/checkpatch.pl *.patch
...
---
0002-asm-generic-hyperv-convert-hyperv-statuses-to-string.patch
---
ERROR: Macros with complex values should be enclosed in parentheses
#95: FILE: include/asm-generic/hyperv-tlfs.h:192:
+#define __HV_STATUS_DEF(OP) \
+   OP(HV_STATUS_SUCCESS,   0x0) \
...

ERROR: Macros with complex values should be enclosed in parentheses
#119: FILE: include/asm-generic/hyperv-tlfs.h:216:
+#define __HV_MAKE_HV_STATUS_ENUM(NAME, VAL) NAME = (VAL),

ERROR: Macros with multiple statements should be enclosed in a do - while loop
#120: FILE: include/asm-generic/hyperv-tlfs.h:217:
+#define __HV_MAKE_HV_STATUS_CASE(NAME, VAL) case (NAME): return (#NAME);

WARNING: Macros with flow control statements should be avoided
#120: FILE: include/asm-generic/hyperv-tlfs.h:217:
+#define __HV_MAKE_HV_STATUS_CASE(NAME, VAL) case (NAME): return (#NAME);

WARNING: macros should not use a trailing semicolon
#120: FILE: include/asm-generic/hyperv-tlfs.h:217:
+#define __HV_MAKE_HV_STATUS_CASE(NAME, VAL) case (NAME): return (#NAME);

total: 3 errors, 2 warnings, 108 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
  mechanically convert to the typical style using --fix or --fix-inplace.

0002-asm-generic-hyperv-convert-hyperv-statuses-to-string.patch has
style problems, please review.
...
---
0004-drivers-hv-check-extension-ioctl.patch
---
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#36: 
new file mode 100644

WARNING: ENOTSUPP is not a SUSV4 error code, prefer EOPNOTSUPP
#131: FILE: drivers/hv/mshv_main.c:52:
+   return -ENOTSUPP;

total: 0 errors, 2 warnings, 137 lines checked

...

WARNING: Improper SPDX comment style for 'drivers/hv/hv_call.c', please use 
'//' instead
#94: FILE: drivers/hv/hv_call.c:1:
+/* SPDX-License-Identifier: GPL-2.0-only */

WARNING: Missing or malformed SPDX-License-Identifier tag in line 1
#94: FILE: drivers/hv/hv_call.c:1:
+/* SPDX-License-Identifier: GPL-2.0-only */

ERROR: "(foo*)" should be "(foo *)"
#178: FILE: drivers/hv/hv_call.c:85:
+   *(u64*)&input);

ERROR: "(foo*)" should be "(foo *)"
#201: FILE: drivers/hv/hv_call.c:108:
+   *(u64*)&input);

ERROR: "(foo*)" should be "(foo *)"
#215: FILE: drivers/hv/hv_call.c:122:
+   status = hv_do_fast_hypercall8(HVCALL_DELETE_PARTITION, *(u64*)&input);

total: 3 errors, 3 warnings, 330 lines checked

...

[PATCH v4 3/6] x86/sev-es: Split up runtime #VC handler for correct state tracking

2021-06-10 Thread Joerg Roedel

From: Joerg Roedel 

Split up the #VC handler code into a from-user and a from-kernel part.
This allows clean and correct state tracking, as the #VC handler needs
to enter NMI-state when raised from kernel mode and plain IRQ state when
raised from user-mode.

Fixes: 62441a1fb532 ("x86/sev-es: Correctly track IRQ states in runtime #VC 
handler")
Suggested-by: Peter Zijlstra 
Signed-off-by: Joerg Roedel 
---
 arch/x86/kernel/sev.c | 118 --
 1 file changed, 68 insertions(+), 50 deletions(-)

diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 2a922d1b03c8..475bbc1b3547 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -1326,43 +1326,14 @@ static __always_inline bool on_vc_fallback_stack(struct 
pt_regs *regs)
return (sp >= __this_cpu_ist_bottom_va(VC2) && sp < 
__this_cpu_ist_top_va(VC2));
 }
 
-/*
- * Main #VC exception handler. It is called when the entry code was able to
- * switch off the IST to a safe kernel stack.
- *
- * With the current implementation it is always possible to switch to a safe
- * stack because #VC exceptions only happen at known places, like intercepted
- * instructions or accesses to MMIO areas/IO ports. They can also happen with
- * code instrumentation when the hypervisor intercepts #DB, but the critical
- * paths are forbidden to be instrumented, so #DB exceptions currently also
- * only happen in safe places.
- */
-DEFINE_IDTENTRY_VC_SAFE_STACK(exc_vmm_communication)
+static bool vc_raw_handle_exception(struct pt_regs *regs, unsigned long 
error_code)
 {
-   irqentry_state_t irq_state;
struct ghcb_state state;
struct es_em_ctxt ctxt;
enum es_result result;
unsigned long flags;
struct ghcb *ghcb;
-
-   /*
-* Handle #DB before calling into !noinstr code to avoid recursive #DB.
-*/
-   if (error_code == SVM_EXIT_EXCP_BASE + X86_TRAP_DB) {
-   vc_handle_trap_db(regs);
-   return;
-   }
-
-   irq_state = irqentry_nmi_enter(regs);
-   lockdep_assert_irqs_disabled();
-   instrumentation_begin();
-
-   /*
-* This is invoked through an interrupt gate, so IRQs are disabled. The
-* code below might walk page-tables for user or kernel addresses, so
-* keep the IRQs disabled to protect us against concurrent TLB flushes.
-*/
+   bool ret = true;
 
ghcb = sev_es_get_ghcb(&state, &flags);
 
@@ -1382,15 +1353,18 @@ DEFINE_IDTENTRY_VC_SAFE_STACK(exc_vmm_communication)
case ES_UNSUPPORTED:
pr_err_ratelimited("Unsupported exit-code 0x%02lx in #VC 
exception (IP: 0x%lx)\n",
   error_code, regs->ip);
-   goto fail;
+   ret = false;
+   break;
case ES_VMM_ERROR:
pr_err_ratelimited("Failure in communication with VMM 
(exit-code 0x%02lx IP: 0x%lx)\n",
   error_code, regs->ip);
-   goto fail;
+   ret = false;
+   break;
case ES_DECODE_FAILED:
pr_err_ratelimited("Failed to decode instruction (exit-code 
0x%02lx IP: 0x%lx)\n",
   error_code, regs->ip);
-   goto fail;
+   ret = false;
+   break;
case ES_EXCEPTION:
vc_forward_exception(&ctxt);
break;
@@ -1406,24 +1380,16 @@ DEFINE_IDTENTRY_VC_SAFE_STACK(exc_vmm_communication)
BUG();
}
 
-out:
-   instrumentation_end();
-   irqentry_nmi_exit(regs, irq_state);
+   return ret;
+}
 
-   return;
+static void vc_handle_from_kernel(struct pt_regs *regs, unsigned long 
error_code)
+{
+   irqentry_state_t irq_state = irqentry_nmi_enter(regs);
 
-fail:
-   if (user_mode(regs)) {
-   /*
-* Do not kill the machine if user-space triggered the
-* exception. Send SIGBUS instead and let user-space deal with
-* it.
-*/
-   force_sig_fault(SIGBUS, BUS_OBJERR, (void __user *)0);
-   } else {
-   pr_emerg("PANIC: Unhandled #VC exception in kernel space 
(result=%d)\n",
-result);
+   instrumentation_begin();
 
+   if (!vc_raw_handle_exception(regs, error_code)) {
/* Show some debug info */
show_regs(regs);
 
@@ -1434,7 +1400,59 @@ DEFINE_IDTENTRY_VC_SAFE_STACK(exc_vmm_communication)
panic("Returned from Terminate-Request to Hypervisor\n");
}
 
-   goto out;
+   instrumentation_end();
+   irqentry_nmi_exit(regs, irq_state);
+}
+
+static void vc_handle_from_user(struct pt_regs *regs, unsigned long error_code)
+{
+   irqentry_state_t irq_state = irqentry_enter(regs);
+
+   instrumentation_begin();
+
+   if (!vc_raw_handle_exception(regs, error_code)) {
+   /*
+

[PATCH v4 4/6] x86/insn-eval: Make 0 a valid RIP for insn_get_effective_ip()

2021-06-10 Thread Joerg Roedel

From: Joerg Roedel 

In theory 0 is a valid value for the instruction pointer, so don't use
it as the error return value from insn_get_effective_ip().

Signed-off-by: Joerg Roedel 
---
 arch/x86/lib/insn-eval.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c
index a67afd74232c..4eecb9c7c6a0 100644
--- a/arch/x86/lib/insn-eval.c
+++ b/arch/x86/lib/insn-eval.c
@@ -1417,7 +1417,7 @@ void __user *insn_get_addr_ref(struct insn *insn, struct 
pt_regs *regs)
}
 }
 
-static unsigned long insn_get_effective_ip(struct pt_regs *regs)
+static int insn_get_effective_ip(struct pt_regs *regs, unsigned long *ip)
 {
unsigned long seg_base = 0;
 
@@ -1430,10 +1430,12 @@ static unsigned long insn_get_effective_ip(struct 
pt_regs *regs)
if (!user_64bit_mode(regs)) {
seg_base = insn_get_seg_base(regs, INAT_SEG_REG_CS);
if (seg_base == -1L)
-   return 0;
+   return -EINVAL;
}
 
-   return seg_base + regs->ip;
+   *ip = seg_base + regs->ip;
+
+   return 0;
 }
 
 /**
@@ -1455,8 +1457,7 @@ int insn_fetch_from_user(struct pt_regs *regs, unsigned 
char buf[MAX_INSN_SIZE])
unsigned long ip;
int not_copied;
 
-   ip = insn_get_effective_ip(regs);
-   if (!ip)
+   if (insn_get_effective_ip(regs, &ip))
return 0;
 
not_copied = copy_from_user(buf, (void __user *)ip, MAX_INSN_SIZE);
@@ -1484,8 +1485,7 @@ int insn_fetch_from_user_inatomic(struct pt_regs *regs, 
unsigned char buf[MAX_IN
unsigned long ip;
int not_copied;
 
-   ip = insn_get_effective_ip(regs);
-   if (!ip)
+   if (insn_get_effective_ip(regs, &ip))
return 0;
 
not_copied = __copy_from_user_inatomic(buf, (void __user *)ip, 
MAX_INSN_SIZE);
-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v4 5/6] x86/insn: Extend error reporting from insn_fetch_from_user[_inatomic]()

2021-06-10 Thread Joerg Roedel

From: Joerg Roedel 

The error reporting from the insn_fetch_from_user*() functions is not
very verbose. Extend it to include information on whether the linear
RIP could not be calculated or whether the memory access faulted.

This will be used in the SEV-ES code to propagate the correct
exception depending on what went wrong during instruction fetch.

Signed-off-by: Joerg Roedel 
---
 arch/x86/kernel/sev.c|  8 
 arch/x86/kernel/umip.c   | 10 --
 arch/x86/lib/insn-eval.c |  8 ++--
 3 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 475bbc1b3547..a7045a5a94ca 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -267,17 +267,17 @@ static int vc_fetch_insn_kernel(struct es_em_ctxt *ctxt,
 static enum es_result __vc_decode_user_insn(struct es_em_ctxt *ctxt)
 {
char buffer[MAX_INSN_SIZE];
-   int res;
+   int insn_bytes;
 
-   res = insn_fetch_from_user_inatomic(ctxt->regs, buffer);
-   if (!res) {
+   insn_bytes = insn_fetch_from_user_inatomic(ctxt->regs, buffer);
+   if (insn_bytes <= 0) {
ctxt->fi.vector = X86_TRAP_PF;
ctxt->fi.error_code = X86_PF_INSTR | X86_PF_USER;
ctxt->fi.cr2= ctxt->regs->ip;
return ES_EXCEPTION;
}
 
-   if (!insn_decode_from_regs(&ctxt->insn, ctxt->regs, buffer, res))
+   if (!insn_decode_from_regs(&ctxt->insn, ctxt->regs, buffer, insn_bytes))
return ES_DECODE_FAILED;
 
if (ctxt->insn.immediate.got)
diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c
index 8daa70b0d2da..337178809c89 100644
--- a/arch/x86/kernel/umip.c
+++ b/arch/x86/kernel/umip.c
@@ -346,14 +346,12 @@ bool fixup_umip_exception(struct pt_regs *regs)
if (!regs)
return false;
 
-   nr_copied = insn_fetch_from_user(regs, buf);
-
/*
-* The insn_fetch_from_user above could have failed if user code
-* is protected by a memory protection key. Give up on emulation
-* in such a case.  Should we issue a page fault?
+* Give up on emulation if fetching the instruction failed. Should we
+* issue a page fault or a #GP?
 */
-   if (!nr_copied)
+   nr_copied = insn_fetch_from_user(regs, buf);
+   if (nr_copied <= 0)
return false;
 
if (!insn_decode_from_regs(&insn, regs, buf, nr_copied))
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c
index 4eecb9c7c6a0..1b5cdf8b7a4e 100644
--- a/arch/x86/lib/insn-eval.c
+++ b/arch/x86/lib/insn-eval.c
@@ -1451,6 +1451,8 @@ static int insn_get_effective_ip(struct pt_regs *regs, 
unsigned long *ip)
  * Number of instruction bytes copied.
  *
  * 0 if nothing was copied.
+ *
+ * -EINVAL if the linear address of the instruction could not be calculated
  */
 int insn_fetch_from_user(struct pt_regs *regs, unsigned char 
buf[MAX_INSN_SIZE])
 {
@@ -1458,7 +1460,7 @@ int insn_fetch_from_user(struct pt_regs *regs, unsigned 
char buf[MAX_INSN_SIZE])
int not_copied;
 
if (insn_get_effective_ip(regs, &ip))
-   return 0;
+   return -EINVAL;
 
not_copied = copy_from_user(buf, (void __user *)ip, MAX_INSN_SIZE);
 
@@ -1479,6 +1481,8 @@ int insn_fetch_from_user(struct pt_regs *regs, unsigned 
char buf[MAX_INSN_SIZE])
  * Number of instruction bytes copied.
  *
  * 0 if nothing was copied.
+ *
+ * -EINVAL if the linear address of the instruction could not be calculated
  */
 int insn_fetch_from_user_inatomic(struct pt_regs *regs, unsigned char 
buf[MAX_INSN_SIZE])
 {
@@ -1486,7 +1490,7 @@ int insn_fetch_from_user_inatomic(struct pt_regs *regs, 
unsigned char buf[MAX_IN
int not_copied;
 
if (insn_get_effective_ip(regs, &ip))
-   return 0;
+   return -EINVAL;
 
not_copied = __copy_from_user_inatomic(buf, (void __user *)ip, 
MAX_INSN_SIZE);
 
-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v4 6/6] x86/sev-es: Propagate #GP if getting linear instruction address failed

2021-06-10 Thread Joerg Roedel

From: Joerg Roedel 

When an instruction is fetched from user-space, segmentation needs to
be taken into account. This means that getting the linear address of
an instruction can fail. Hardware would raise a #GP
exception in that case, but the #VC exception handler would emulate it
as a page-fault.

The insn_fetch_from_user*() functions now provide the relevant
information in case of an failure. Use that and propagate a #GP when
the linear address of an instruction to fetch could not be calculated.

Signed-off-by: Joerg Roedel 
---
 arch/x86/kernel/sev.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index a7045a5a94ca..80c0d8385def 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -270,11 +270,18 @@ static enum es_result __vc_decode_user_insn(struct 
es_em_ctxt *ctxt)
int insn_bytes;
 
insn_bytes = insn_fetch_from_user_inatomic(ctxt->regs, buffer);
-   if (insn_bytes <= 0) {
+   if (insn_bytes == 0) {
+   /* Nothing could be copied */
ctxt->fi.vector = X86_TRAP_PF;
ctxt->fi.error_code = X86_PF_INSTR | X86_PF_USER;
ctxt->fi.cr2= ctxt->regs->ip;
return ES_EXCEPTION;
+   } else if (insn_bytes == -EINVAL) {
+   /* Effective RIP could not be calculated */
+   ctxt->fi.vector = X86_TRAP_GP;
+   ctxt->fi.error_code = 0;
+   ctxt->fi.cr2= 0;
+   return ES_EXCEPTION;
}
 
if (!insn_decode_from_regs(&ctxt->insn, ctxt->regs, buffer, insn_bytes))
-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v4 2/6] x86/sev-es: Disable IRQs while GHCB is active

2021-06-10 Thread Joerg Roedel

From: Joerg Roedel 

The #VC handler only cares about IRQs being disabled while the GHCB is
active, as it must not be interrupted by something which could cause
another #VC while it holds the GHCB (NMI is the exception for which the
backup GHCB is there).

Make sure nothing interrupts the code path while the GHCB is active by
disabling IRQs in sev_es_get_ghcb() and restoring the previous irq state
in sev_es_put_ghcb().

Signed-off-by: Joerg Roedel 
---
 arch/x86/kernel/sev.c | 39 +--
 1 file changed, 25 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 4fd997bbf059..2a922d1b03c8 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -192,14 +192,23 @@ void noinstr __sev_es_ist_exit(void)
this_cpu_write(cpu_tss_rw.x86_tss.ist[IST_INDEX_VC], *(unsigned long 
*)ist);
 }
 
-static __always_inline struct ghcb *sev_es_get_ghcb(struct ghcb_state *state)
+static __always_inline struct ghcb *sev_es_get_ghcb(struct ghcb_state *state,
+   unsigned long *flags)
 {
struct sev_es_runtime_data *data;
struct ghcb *ghcb;
 
+   /*
+* Nothing shall interrupt this code path while holding the per-cpu
+* GHCB. The backup GHCB is only for NMIs interrupting this path.
+*/
+   local_irq_save(*flags);
+
data = this_cpu_read(runtime_data);
ghcb = &data->ghcb_page;
 
+
+
if (unlikely(data->ghcb_active)) {
/* GHCB is already in use - save its contents */
 
@@ -479,7 +488,8 @@ static enum es_result vc_slow_virt_to_phys(struct ghcb 
*ghcb, struct es_em_ctxt
 /* Include code shared with pre-decompression boot stage */
 #include "sev-shared.c"
 
-static __always_inline void sev_es_put_ghcb(struct ghcb_state *state)
+static __always_inline void sev_es_put_ghcb(struct ghcb_state *state,
+   unsigned long flags)
 {
struct sev_es_runtime_data *data;
struct ghcb *ghcb;
@@ -500,14 +510,17 @@ static __always_inline void sev_es_put_ghcb(struct 
ghcb_state *state)
vc_ghcb_invalidate(ghcb);
data->ghcb_active = false;
}
+
+   local_irq_restore(flags);
 }
 
 void noinstr __sev_es_nmi_complete(void)
 {
struct ghcb_state state;
+   unsigned long flags;
struct ghcb *ghcb;
 
-   ghcb = sev_es_get_ghcb(&state);
+   ghcb = sev_es_get_ghcb(&state, &flags);
 
vc_ghcb_invalidate(ghcb);
ghcb_set_sw_exit_code(ghcb, SVM_VMGEXIT_NMI_COMPLETE);
@@ -517,7 +530,7 @@ void noinstr __sev_es_nmi_complete(void)
sev_es_wr_ghcb_msr(__pa_nodebug(ghcb));
VMGEXIT();
 
-   sev_es_put_ghcb(&state);
+   sev_es_put_ghcb(&state, flags);
 }
 
 static u64 get_jump_table_addr(void)
@@ -527,9 +540,7 @@ static u64 get_jump_table_addr(void)
struct ghcb *ghcb;
u64 ret = 0;
 
-   local_irq_save(flags);
-
-   ghcb = sev_es_get_ghcb(&state);
+   ghcb = sev_es_get_ghcb(&state, &flags);
 
vc_ghcb_invalidate(ghcb);
ghcb_set_sw_exit_code(ghcb, SVM_VMGEXIT_AP_JUMP_TABLE);
@@ -543,9 +554,7 @@ static u64 get_jump_table_addr(void)
ghcb_sw_exit_info_2_is_valid(ghcb))
ret = ghcb->save.sw_exit_info_2;
 
-   sev_es_put_ghcb(&state);
-
-   local_irq_restore(flags);
+   sev_es_put_ghcb(&state, flags);
 
return ret;
 }
@@ -666,9 +675,10 @@ static bool __init sev_es_setup_ghcb(void)
 static void sev_es_ap_hlt_loop(void)
 {
struct ghcb_state state;
+   unsigned long flags;
struct ghcb *ghcb;
 
-   ghcb = sev_es_get_ghcb(&state);
+   ghcb = sev_es_get_ghcb(&state, &flags);
 
while (true) {
vc_ghcb_invalidate(ghcb);
@@ -685,7 +695,7 @@ static void sev_es_ap_hlt_loop(void)
break;
}
 
-   sev_es_put_ghcb(&state);
+   sev_es_put_ghcb(&state, flags);
 }
 
 /*
@@ -1333,6 +1343,7 @@ DEFINE_IDTENTRY_VC_SAFE_STACK(exc_vmm_communication)
struct ghcb_state state;
struct es_em_ctxt ctxt;
enum es_result result;
+   unsigned long flags;
struct ghcb *ghcb;
 
/*
@@ -1353,7 +1364,7 @@ DEFINE_IDTENTRY_VC_SAFE_STACK(exc_vmm_communication)
 * keep the IRQs disabled to protect us against concurrent TLB flushes.
 */
 
-   ghcb = sev_es_get_ghcb(&state);
+   ghcb = sev_es_get_ghcb(&state, &flags);
 
vc_ghcb_invalidate(ghcb);
result = vc_init_em_ctxt(&ctxt, regs, error_code);
@@ -1361,7 +1372,7 @@ DEFINE_IDTENTRY_VC_SAFE_STACK(exc_vmm_communication)
if (result == ES_OK)
result = vc_handle_exitcode(&ctxt, ghcb, error_code);
 
-   sev_es_put_ghcb(&state);
+   sev_es_put_ghcb(&state, flags);
 
/* Done - now check the result */
switch (result) {
-- 
2.31.1

___
Virtualization mailing

[PATCH v4 0/6] x86/sev-es: Fixes for SEV-ES Guest Support

2021-06-10 Thread Joerg Roedel

From: Joerg Roedel 

Hi,

here is the next revision of my pending fixes for Linux' SEV-ES
support. Changes to the previous version are:

- Removed first patch which is now in tip/x86/urgent already
- Removed patch "x86/sev-es: Run #VC handler in plain IRQ state"
  and replaced it with
  "x86/sev-es: Split up runtime #VC handler for correct state tracking"
  as per suggestion from PeterZ

Changes are based on tip/x86/urgent. Please review.

Thanks,

Joerg

Joerg Roedel (6):
  x86/sev-es: Fix error message in runtime #VC handler
  x86/sev-es: Disable IRQs while GHCB is active
  x86/sev-es: Split up runtime #VC handler for correct state tracking
  x86/insn-eval: Make 0 a valid RIP for insn_get_effective_ip()
  x86/insn: Extend error reporting from
insn_fetch_from_user[_inatomic]()
  x86/sev-es: Propagate #GP if getting linear instruction address failed

 arch/x86/kernel/sev.c| 174 +++
 arch/x86/kernel/umip.c   |  10 +--
 arch/x86/lib/insn-eval.c |  22 +++--
 3 files changed, 122 insertions(+), 84 deletions(-)


base-commit: efa165504943f2128d50f63de0c02faf6dcceb0d
-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v4 1/6] x86/sev-es: Fix error message in runtime #VC handler

2021-06-10 Thread Joerg Roedel

From: Joerg Roedel 

The runtime #VC handler is not "early" anymore. Fix the copy&paste error
and remove that word from the error message.

Signed-off-by: Joerg Roedel 
---
 arch/x86/kernel/sev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 651b81cd648e..4fd997bbf059 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -1369,7 +1369,7 @@ DEFINE_IDTENTRY_VC_SAFE_STACK(exc_vmm_communication)
vc_finish_insn(&ctxt);
break;
case ES_UNSUPPORTED:
-   pr_err_ratelimited("Unsupported exit-code 0x%02lx in early #VC 
exception (IP: 0x%lx)\n",
+   pr_err_ratelimited("Unsupported exit-code 0x%02lx in #VC 
exception (IP: 0x%lx)\n",
   error_code, regs->ip);
goto fail;
case ES_VMM_ERROR:
-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH net-next v5 11/15] virtio-net: move to virtio_net.h

2021-06-10 Thread Xuan Zhuo

Move some structure definitions and inline functions into the
virtio_net.h file.

Signed-off-by: Xuan Zhuo 
---
 drivers/net/virtio/virtio_net.c | 225 +--
 drivers/net/virtio/virtio_net.h | 230 
 2 files changed, 232 insertions(+), 223 deletions(-)
 create mode 100644 drivers/net/virtio/virtio_net.h

diff --git a/drivers/net/virtio/virtio_net.c b/drivers/net/virtio/virtio_net.c
index 953739860563..395ec1f18331 100644
--- a/drivers/net/virtio/virtio_net.c
+++ b/drivers/net/virtio/virtio_net.c
@@ -4,24 +4,8 @@
  * Copyright 2007 Rusty Russell  IBM Corporation
  */
 //#define DEBUG
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
+
+#include "virtio_net.h"
 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
@@ -44,15 +28,6 @@ module_param(napi_tx, bool, 0644);
 #define VIRTIO_XDP_TX  BIT(0)
 #define VIRTIO_XDP_REDIR   BIT(1)
 
-#define VIRTIO_XDP_FLAGBIT(0)
-
-/* RX packet size EWMA. The average packet size is used to determine the packet
- * buffer size when refilling RX rings. As the entire RX ring may be refilled
- * at once, the weight is chosen so that the EWMA will be insensitive to short-
- * term, transient changes in packet size.
- */
-DECLARE_EWMA(pkt_len, 0, 64)
-
 #define VIRTNET_DRIVER_VERSION "1.0.0"
 
 static const unsigned long guest_offloads[] = {
@@ -68,35 +43,6 @@ static const unsigned long guest_offloads[] = {
(1ULL << VIRTIO_NET_F_GUEST_ECN)  | \
(1ULL << VIRTIO_NET_F_GUEST_UFO))
 
-struct virtnet_stat_desc {
-   char desc[ETH_GSTRING_LEN];
-   size_t offset;
-};
-
-struct virtnet_sq_stats {
-   struct u64_stats_sync syncp;
-   u64 packets;
-   u64 bytes;
-   u64 xdp_tx;
-   u64 xdp_tx_drops;
-   u64 kicks;
-};
-
-struct virtnet_rq_stats {
-   struct u64_stats_sync syncp;
-   u64 packets;
-   u64 bytes;
-   u64 drops;
-   u64 xdp_packets;
-   u64 xdp_tx;
-   u64 xdp_redirects;
-   u64 xdp_drops;
-   u64 kicks;
-};
-
-#define VIRTNET_SQ_STAT(m) offsetof(struct virtnet_sq_stats, m)
-#define VIRTNET_RQ_STAT(m) offsetof(struct virtnet_rq_stats, m)
-
 static const struct virtnet_stat_desc virtnet_sq_stats_desc[] = {
{ "packets",VIRTNET_SQ_STAT(packets) },
{ "bytes",  VIRTNET_SQ_STAT(bytes) },
@@ -119,54 +65,6 @@ static const struct virtnet_stat_desc 
virtnet_rq_stats_desc[] = {
 #define VIRTNET_SQ_STATS_LEN   ARRAY_SIZE(virtnet_sq_stats_desc)
 #define VIRTNET_RQ_STATS_LEN   ARRAY_SIZE(virtnet_rq_stats_desc)
 
-/* Internal representation of a send virtqueue */
-struct send_queue {
-   /* Virtqueue associated with this send _queue */
-   struct virtqueue *vq;
-
-   /* TX: fragments + linear part + virtio header */
-   struct scatterlist sg[MAX_SKB_FRAGS + 2];
-
-   /* Name of the send queue: output.$index */
-   char name[40];
-
-   struct virtnet_sq_stats stats;
-
-   struct napi_struct napi;
-};
-
-/* Internal representation of a receive virtqueue */
-struct receive_queue {
-   /* Virtqueue associated with this receive_queue */
-   struct virtqueue *vq;
-
-   struct napi_struct napi;
-
-   struct bpf_prog __rcu *xdp_prog;
-
-   struct virtnet_rq_stats stats;
-
-   /* Chain pages by the private ptr. */
-   struct page *pages;
-
-   /* Average packet length for mergeable receive buffers. */
-   struct ewma_pkt_len mrg_avg_pkt_len;
-
-   /* Page frag for packet buffer allocation. */
-   struct page_frag alloc_frag;
-
-   /* RX: fragments + linear part + virtio header */
-   struct scatterlist sg[MAX_SKB_FRAGS + 2];
-
-   /* Min single buffer size for mergeable buffers case. */
-   unsigned int min_buf_len;
-
-   /* Name of this receive queue: input.$index */
-   char name[40];
-
-   struct xdp_rxq_info xdp_rxq;
-};
-
 /* Control VQ buffers: protected by the rtnl lock */
 struct control_buf {
struct virtio_net_ctrl_hdr hdr;
@@ -178,67 +76,6 @@ struct control_buf {
__virtio64 offloads;
 };
 
-struct virtnet_info {
-   struct virtio_device *vdev;
-   struct virtqueue *cvq;
-   struct net_device *dev;
-   struct send_queue *sq;
-   struct receive_queue *rq;
-   unsigned int status;
-
-   /* Max # of queue pairs supported by the device */
-   u16 max_queue_pairs;
-
-   /* # of queue pairs currently used by the driver */
-   u16 curr_queue_pairs;
-
-   /* # of XDP queue pairs currently used by the driver */
-   u16 xdp_queue_pairs;
-
-   /* xdp_queue_pairs may be 0, when xdp is already loaded. So add this. */
-   bool xdp_enabled;
-
-   /* I like... big packets and I cannot lie! */
-   bo

[PATCH net-next v5 03/15] virtio-net: add priv_flags IFF_NOT_USE_DMA_ADDR

2021-06-10 Thread Xuan Zhuo

virtio-net not use dma addr directly. So add this priv_flags
IFF_NOT_USE_DMA_ADDR.

Signed-off-by: Xuan Zhuo 
---
 drivers/net/virtio_net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 0416a7e00914..6c1233f0ab3e 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3064,7 +3064,7 @@ static int virtnet_probe(struct virtio_device *vdev)
 
/* Set up network device as normal. */
dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE |
-  IFF_TX_SKB_NO_LINEAR;
+  IFF_TX_SKB_NO_LINEAR | IFF_NOT_USE_DMA_ADDR;
dev->netdev_ops = &virtnet_netdev;
dev->features = NETIF_F_HIGHDMA;
 
-- 
2.31.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH net-next v5 07/15] virtio-net: standalone virtnet_aloc_frag function

2021-06-10 Thread Xuan Zhuo

This logic is used by small and merge when adding buf, and the
subsequent patch will also use this logic, so it is separated as an
independent function.

Signed-off-by: Xuan Zhuo 
---
 drivers/net/virtio_net.c | 29 -
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index d791543a8dd8..3fd87bf2b2ad 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -264,6 +264,22 @@ static struct xdp_frame *ptr_to_xdp(void *ptr)
return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
 }
 
+static char *virtnet_alloc_frag(struct receive_queue *rq, unsigned int len,
+   int gfp)
+{
+   struct page_frag *alloc_frag = &rq->alloc_frag;
+   char *buf;
+
+   if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
+   return NULL;
+
+   buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
+   get_page(alloc_frag->page);
+   alloc_frag->offset += len;
+
+   return buf;
+}
+
 static void __free_old_xmit(struct send_queue *sq, bool in_napi,
struct virtnet_sq_stats *stats)
 {
@@ -1190,7 +1206,6 @@ static void receive_buf(struct virtnet_info *vi, struct 
receive_queue *rq,
 static int add_recvbuf_small(struct virtnet_info *vi, struct receive_queue *rq,
 gfp_t gfp)
 {
-   struct page_frag *alloc_frag = &rq->alloc_frag;
char *buf;
unsigned int xdp_headroom = virtnet_get_headroom(vi);
void *ctx = (void *)(unsigned long)xdp_headroom;
@@ -1199,12 +1214,10 @@ static int add_recvbuf_small(struct virtnet_info *vi, 
struct receive_queue *rq,
 
len = SKB_DATA_ALIGN(len) +
  SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
-   if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
+   buf = virtnet_alloc_frag(rq, len, gfp);
+   if (unlikely(!buf))
return -ENOMEM;
 
-   buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
-   get_page(alloc_frag->page);
-   alloc_frag->offset += len;
sg_init_one(rq->sg, buf + VIRTNET_RX_PAD + xdp_headroom,
vi->hdr_len + GOOD_PACKET_LEN);
err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, 1, buf, ctx, gfp);
@@ -1295,13 +1308,11 @@ static int add_recvbuf_mergeable(struct virtnet_info 
*vi,
 * disabled GSO for XDP, it won't be a big issue.
 */
len = get_mergeable_buf_len(rq, &rq->mrg_avg_pkt_len, room);
-   if (unlikely(!skb_page_frag_refill(len + room, alloc_frag, gfp)))
+   buf = virtnet_alloc_frag(rq, len + room, gfp);
+   if (unlikely(!buf))
return -ENOMEM;
 
-   buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
buf += headroom; /* advance address leaving hole at front of pkt */
-   get_page(alloc_frag->page);
-   alloc_frag->offset += len + room;
hole = alloc_frag->size - alloc_frag->offset;
if (hole < len + room) {
/* To avoid internal fragmentation, if there is very likely not
-- 
2.31.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH net-next v5 15/15] virtio-net: xsk zero copy xmit kick by threshold

2021-06-10 Thread Xuan Zhuo

After testing, the performance of calling kick every time is not stable.
And if all the packets are sent and kicked again, the performance is not
good. So add a module parameter to specify how many packets are sent to
call a kick.

8 is a relatively stable value with the best performance.

Here is the pps of the test of xsk_kick_thr under different values (from
1 to 64).

thr  PPS thr PPS thr PPS
12924116.74247 | 23  3683263.04348 | 45  2777907.22963
23441010.57191 | 24  3078880.13043 | 46  2781376.21739
33636728.72378 | 25  2859219.57656 | 47  2777271.91304
43637518.61468 | 26  2851557.9593  | 48  2800320.56575
53651738.16251 | 27  2834783.54408 | 49  2813039.87599
63652176.69231 | 28  2847012.41472 | 50  3445143.01839
73665415.80602 | 29  2860633.91304 | 51  3666918.01281
83665045.16555 | 30  2857903.5786  | 52  3059929.2709
93671023.2401  | 31  2835589.98963 | 53  2831515.21739
10   3669532.23274 | 32  2862827.88706 | 54  3451804.07204
11   3666160.37749 | 33  2871855.96696 | 55  3654975.92385
12   3674951.44813 | 34  3434456.44816 | 56  3676198.3188
13   3667447.57331 | 35  3656918.54177 | 57  3684740.85619
14   3018846.0503  | 36  3596921.16722 | 58  3060958.8594
15   2792773.84505 | 37  3603460.63903 | 59  2828874.57191
16   3430596.3602  | 38  3595410.87666 | 60  3459926.11027
17   3660525.85806 | 39  3604250.17819 | 61  3685444.47599
18   3045627.69054 | 40  3596542.28428 | 62  3049959.0809
19   2841542.94177 | 41  3600705.16054 | 63  2806280.04013
20   2830475.97348 | 42  3019833.71191 | 64  3448494.3913
21   2845655.55789 | 43  2752951.93264 |
22   3450389.84365 | 44  2753107.27164 |

It can be found that when the value of xsk_kick_thr is relatively small,
the performance is not good, and when its value is greater than 13, the
performance will be more irregular and unstable. It looks similar from 3
to 13, I chose 8 as the default value.

The test environment is qemu + vhost-net. I modified vhost-net to drop
the packets sent by vm directly, so that the cpu of vm can run higher.
By default, the processes in the vm and the cpu of softirqd are too low,
and there is no obvious difference in the test data.

During the test, the cpu of softirq reached 100%. Each xsk_kick_thr was
run for 300s, the pps of every second was recorded, and the average of
the pps was finally taken. The vhost process cpu on the host has also
reached 100%.

Signed-off-by: Xuan Zhuo 
Reviewed-by: Dust Li 
---
 drivers/net/virtio/virtio_net.c |  1 +
 drivers/net/virtio/xsk.c| 18 --
 drivers/net/virtio/xsk.h|  2 ++
 3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio/virtio_net.c b/drivers/net/virtio/virtio_net.c
index 9503133e71f0..dfe509939b45 100644
--- a/drivers/net/virtio/virtio_net.c
+++ b/drivers/net/virtio/virtio_net.c
@@ -14,6 +14,7 @@ static bool csum = true, gso = true, napi_tx = true;
 module_param(csum, bool, 0444);
 module_param(gso, bool, 0444);
 module_param(napi_tx, bool, 0644);
+module_param(xsk_kick_thr, int, 0644);
 
 /* FIXME: MTU in config. */
 #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
index 3973c82d1ad2..2f3ba6ab4798 100644
--- a/drivers/net/virtio/xsk.c
+++ b/drivers/net/virtio/xsk.c
@@ -5,6 +5,8 @@
 
 #include "virtio_net.h"
 
+int xsk_kick_thr = 8;
+
 static struct virtio_net_hdr_mrg_rxbuf xsk_hdr;
 
 static struct virtnet_xsk_ctx *virtnet_xsk_ctx_get(struct virtnet_xsk_ctx_head 
*head)
@@ -455,6 +457,7 @@ static int virtnet_xsk_xmit_batch(struct send_queue *sq,
struct xdp_desc desc;
int err, packet = 0;
int ret = -EAGAIN;
+   int need_kick = 0;
 
while (budget-- > 0) {
if (sq->vq->num_free < 2 + MAX_SKB_FRAGS) {
@@ -475,11 +478,22 @@ static int virtnet_xsk_xmit_batch(struct send_queue *sq,
}
 
++packet;
+   ++need_kick;
+   if (need_kick > xsk_kick_thr) {
+   if (virtqueue_kick_prepare(sq->vq) &&
+   virtqueue_notify(sq->vq))
+   ++stats->kicks;
+
+   need_kick = 0;
+   }
}
 
if (packet) {
-   if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq))
-   ++stats->kicks;
+   if (need_kick) {
+   if (virtqueue_kick_prepare(sq->vq) &&
+   virtqueue_notify(sq->vq))
+   ++stats->kicks;
+   }
 
*done += packet;
stats->xdp_tx += packet;
diff --git a/drivers/net/virtio/xsk.h b/drivers/net/virtio/xsk.h
index fe22cf78d505..4f0f4f9cf23b 100644
--- a/drivers/net/virtio/xsk.h
+++ b/drivers/net/virtio/xsk.h
@@ -7,6 +7,8 @@
 
 #define VIRTNET_XSK_BUFF_CTX  ((void *)(unsigned long)~0L)
 
+extern int xsk_kick_thr;
+
 /* When xsk di

[PATCH net-next v5 13/15] virtio-net: support AF_XDP zc rx

2021-06-10 Thread Xuan Zhuo

Compared to the case of xsk tx, the case of xsk zc rx is more
complicated.

When we process the buf received by vq, we may encounter ordinary
buffers, or xsk buffers. What makes the situation more complicated is
that in the case of mergeable, when num_buffer > 1, we may still
encounter the case where xsk buffer is mixed with ordinary buffer.

Another thing that makes the situation more complicated is that when we
get an xsk buffer from vq, the xsk bound to this xsk buffer may have
been unbound.

Signed-off-by: Xuan Zhuo 
---
 drivers/net/virtio/virtio_net.c | 238 ++-
 drivers/net/virtio/virtio_net.h |  27 +++
 drivers/net/virtio/xsk.c| 396 +++-
 drivers/net/virtio/xsk.h|  75 ++
 4 files changed, 678 insertions(+), 58 deletions(-)

diff --git a/drivers/net/virtio/virtio_net.c b/drivers/net/virtio/virtio_net.c
index 40d7751f1c5f..9503133e71f0 100644
--- a/drivers/net/virtio/virtio_net.c
+++ b/drivers/net/virtio/virtio_net.c
@@ -125,11 +125,6 @@ static int rxq2vq(int rxq)
return rxq * 2;
 }
 
-static inline struct virtio_net_hdr_mrg_rxbuf *skb_vnet_hdr(struct sk_buff 
*skb)
-{
-   return (struct virtio_net_hdr_mrg_rxbuf *)skb->cb;
-}
-
 /*
  * private is used to chain pages for big packets, put the whole
  * most recent used list in the beginning for reuse
@@ -458,6 +453,68 @@ static unsigned int virtnet_get_headroom(struct 
virtnet_info *vi)
return vi->xdp_enabled ? VIRTIO_XDP_HEADROOM : 0;
 }
 
+/* return value:
+ *  1: XDP_PASS should handle to build skb
+ * -1: xdp err, should handle to free the buf and return NULL
+ *  0: buf has been consumed by xdp
+ */
+int virtnet_run_xdp(struct net_device *dev,
+   struct bpf_prog *xdp_prog,
+   struct xdp_buff *xdp,
+   unsigned int *xdp_xmit,
+   struct virtnet_rq_stats *stats)
+{
+   struct xdp_frame *xdpf;
+   int act, err;
+
+   act = bpf_prog_run_xdp(xdp_prog, xdp);
+   stats->xdp_packets++;
+
+   switch (act) {
+   case XDP_PASS:
+   return 1;
+
+   case XDP_TX:
+   stats->xdp_tx++;
+   xdpf = xdp_convert_buff_to_frame(xdp);
+   if (unlikely(!xdpf))
+   goto err_xdp;
+   err = virtnet_xdp_xmit(dev, 1, &xdpf, 0);
+   if (unlikely(!err)) {
+   xdp_return_frame_rx_napi(xdpf);
+   } else if (unlikely(err < 0)) {
+   trace_xdp_exception(dev, xdp_prog, act);
+   goto err_xdp;
+   }
+   *xdp_xmit |= VIRTIO_XDP_TX;
+   return 0;
+
+   case XDP_REDIRECT:
+   stats->xdp_redirects++;
+   err = xdp_do_redirect(dev, xdp, xdp_prog);
+   if (err)
+   goto err_xdp;
+
+   *xdp_xmit |= VIRTIO_XDP_REDIR;
+   return 0;
+
+   default:
+   bpf_warn_invalid_xdp_action(act);
+   fallthrough;
+
+   case XDP_ABORTED:
+   trace_xdp_exception(dev, xdp_prog, act);
+   fallthrough;
+
+   case XDP_DROP:
+   break;
+   }
+
+err_xdp:
+   stats->xdp_drops++;
+   return -1;
+}
+
 /* We copy the packet for XDP in the following cases:
  *
  * 1) Packet is scattered across multiple rx buffers.
@@ -491,27 +548,40 @@ static struct page *xdp_linearize_page(struct 
receive_queue *rq,
int tailroom = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
unsigned int buflen;
void *buf;
+   void *ctx;
int off;
 
-   buf = virtqueue_get_buf(rq->vq, &buflen);
+   buf = virtqueue_get_buf_ctx(rq->vq, &buflen, &ctx);
if (unlikely(!buf))
goto err_buf;
 
-   p = virt_to_head_page(buf);
-   off = buf - page_address(p);
-
/* guard against a misconfigured or uncooperative backend that
 * is sending packet larger than the MTU.
 */
if ((page_off + buflen + tailroom) > PAGE_SIZE) {
-   put_page(p);
+   virtnet_rx_put_buf(buf, ctx);
goto err_buf;
}
 
-   memcpy(page_address(page) + page_off,
-  page_address(p) + off, buflen);
+   if (is_xsk_ctx(ctx)) {
+   struct virtnet_xsk_ctx_rx *xsk_ctx;
+
+   xsk_ctx = (struct virtnet_xsk_ctx_rx *)buf;
+
+   virtnet_xsk_ctx_rx_copy(xsk_ctx,
+   page_address(page) + page_off,
+   buflen, true);
+
+   virtnet_xsk_ctx_rx_put(xsk_ctx);
+   } else {
+   p = virt_to_head_page(buf);
+   off = buf -

[PATCH net-next v5 14/15] virtio-net: xsk direct xmit inside xsk wakeup

2021-06-10 Thread Xuan Zhuo

Calling virtqueue_napi_schedule() in wakeup results in napi running on
the current cpu. If the application is not busy, then there is no
problem. But if the application itself is busy, it will cause a lot of
scheduling.

If the application is continuously sending data packets, due to the
continuous scheduling between the application and napi, the data packet
transmission will not be smooth, and there will be an obvious delay in
the transmission (you can use tcpdump to see it). When pressing a
channel to 100% (vhost reaches 100%), the cpu where the application is
located reaches 100%.

This patch sends a small amount of data directly in wakeup. The purpose
of this is to trigger the tx interrupt. The tx interrupt will be
awakened on the cpu of its affinity, and then trigger the operation of
the napi mechanism, napi can continue to consume the xsk tx queue. Two
cpus are running, cpu0 is running applications, cpu1 executes
napi consumption data. The same is to press a channel to 100%, but the
utilization rate of cpu0 is 12.7% and the utilization rate of cpu1 is
2.9%.

Signed-off-by: Xuan Zhuo 
---
 drivers/net/virtio/xsk.c | 28 +++-
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
index 36cda2dcf8e7..3973c82d1ad2 100644
--- a/drivers/net/virtio/xsk.c
+++ b/drivers/net/virtio/xsk.c
@@ -547,6 +547,7 @@ int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 
flag)
 {
struct virtnet_info *vi = netdev_priv(dev);
struct xsk_buff_pool *pool;
+   struct netdev_queue *txq;
struct send_queue *sq;
 
if (!netif_running(dev))
@@ -559,11 +560,28 @@ int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, 
u32 flag)
 
rcu_read_lock();
pool = rcu_dereference(sq->xsk.pool);
-   if (pool) {
-   local_bh_disable();
-   virtqueue_napi_schedule(&sq->napi, sq->vq);
-   local_bh_enable();
-   }
+   if (!pool)
+   goto end;
+
+   if (napi_if_scheduled_mark_missed(&sq->napi))
+   goto end;
+
+   txq = netdev_get_tx_queue(dev, qid);
+
+   __netif_tx_lock_bh(txq);
+
+   /* Send part of the packet directly to reduce the delay in sending the
+* packet, and this can actively trigger the tx interrupts.
+*
+* If no packet is sent out, the ring of the device is full. In this
+* case, we will still get a tx interrupt response. Then we will deal
+* with the subsequent packet sending work.
+*/
+   virtnet_xsk_run(sq, pool, sq->napi.weight, false);
+
+   __netif_tx_unlock_bh(txq);
+
+end:
rcu_read_unlock();
return 0;
 }
-- 
2.31.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH net-next v5 12/15] virtio-net: support AF_XDP zc tx

2021-06-10 Thread Xuan Zhuo

AF_XDP(xdp socket, xsk) is a high-performance packet receiving and
sending technology.

This patch implements the binding and unbinding operations of xsk and
the virtio-net queue for xsk zero copy xmit.

The xsk zero copy xmit depends on tx napi. Because the actual sending
of data is done in the process of tx napi. If tx napi does not
work, then the data of the xsk tx queue will not be sent.
So if tx napi is not true, an error will be reported when bind xsk.

If xsk is active, it will prevent ethtool from modifying tx napi.

When reclaiming ptr, a new type of ptr is added, which is distinguished
based on the last two digits of ptr:
00: skb
01: xdp frame
10: xsk xmit ptr

All sent xsk packets share the virtio-net header of xsk_hdr. If xsk
needs to support csum and other functions later, consider assigning xsk
hdr separately for each sent packet.

Different from other physical network cards, you can reinitialize the
channel when you bind xsk. And vrtio does not support independent reset
channel, you can only reset the entire device. I think it is not
appropriate for us to directly reset the entire setting. So the
situation becomes a bit more complicated. We have to consider how
to deal with the buffer referenced in vq after xsk is unbind.

I added the ring size struct virtnet_xsk_ctx when xsk been bind. Each xsk
buffer added to vq corresponds to a ctx. This ctx is used to record the
page where the xsk buffer is located, and add a page reference. When the
buffer is recycling, reduce the reference to page. When xsk has been
unbind, and all related xsk buffers have been recycled, release all ctx.

Signed-off-by: Xuan Zhuo 
Reviewed-by: Dust Li 
---
 drivers/net/virtio/Makefile |   1 +
 drivers/net/virtio/virtio_net.c |  20 +-
 drivers/net/virtio/virtio_net.h |  37 +++-
 drivers/net/virtio/xsk.c| 346 
 drivers/net/virtio/xsk.h|  99 +
 5 files changed, 497 insertions(+), 6 deletions(-)
 create mode 100644 drivers/net/virtio/xsk.c
 create mode 100644 drivers/net/virtio/xsk.h

diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index ccc80f40f33a..db79d2e7925f 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -4,3 +4,4 @@
 #
 
 obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
+obj-$(CONFIG_VIRTIO_NET) += xsk.o
diff --git a/drivers/net/virtio/virtio_net.c b/drivers/net/virtio/virtio_net.c
index 395ec1f18331..40d7751f1c5f 100644
--- a/drivers/net/virtio/virtio_net.c
+++ b/drivers/net/virtio/virtio_net.c
@@ -1423,6 +1423,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int 
budget)
 
txq = netdev_get_tx_queue(vi->dev, index);
__netif_tx_lock(txq, raw_smp_processor_id());
+   work_done += virtnet_poll_xsk(sq, budget);
free_old_xmit(sq, true);
__netif_tx_unlock(txq);
 
@@ -2133,8 +2134,16 @@ static int virtnet_set_coalesce(struct net_device *dev,
if (napi_weight ^ vi->sq[0].napi.weight) {
if (dev->flags & IFF_UP)
return -EBUSY;
-   for (i = 0; i < vi->max_queue_pairs; i++)
+
+   for (i = 0; i < vi->max_queue_pairs; i++) {
+   /* xsk xmit depend on the tx napi. So if xsk is active,
+* prevent modifications to tx napi.
+*/
+   if (rtnl_dereference(vi->sq[i].xsk.pool))
+   continue;
+
vi->sq[i].napi.weight = napi_weight;
+   }
}
 
return 0;
@@ -2407,6 +2416,8 @@ static int virtnet_xdp(struct net_device *dev, struct 
netdev_bpf *xdp)
switch (xdp->command) {
case XDP_SETUP_PROG:
return virtnet_xdp_set(dev, xdp->prog, xdp->extack);
+   case XDP_SETUP_XSK_POOL:
+   return virtnet_xsk_pool_setup(dev, xdp);
default:
return -EINVAL;
}
@@ -2466,6 +2477,7 @@ static const struct net_device_ops virtnet_netdev = {
.ndo_vlan_rx_kill_vid = virtnet_vlan_rx_kill_vid,
.ndo_bpf= virtnet_xdp,
.ndo_xdp_xmit   = virtnet_xdp_xmit,
+   .ndo_xsk_wakeup = virtnet_xsk_wakeup,
.ndo_features_check = passthru_features_check,
.ndo_get_phys_port_name = virtnet_get_phys_port_name,
.ndo_set_features   = virtnet_set_features,
@@ -2569,10 +2581,12 @@ static void free_unused_bufs(struct virtnet_info *vi)
for (i = 0; i < vi->max_queue_pairs; i++) {
struct virtqueue *vq = vi->sq[i].vq;
while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
-   if (!is_xdp_frame(buf))
+   if (is_skb_ptr(buf))
dev_kfree_skb(buf);
-   else
+   else if (is_xdp_frame(buf))
xdp_return_frame(ptr_to_xdp(buf));
+   else
+

[PATCH net-next v5 05/15] virtio: support virtqueue_detach_unused_buf_ctx

2021-06-10 Thread Xuan Zhuo

Supports returning ctx while recycling unused buf, which helps to
release buf in different ways for different bufs.

Signed-off-by: Xuan Zhuo 
---
 drivers/virtio/virtio_ring.c | 22 +++---
 include/linux/virtio.h   |  2 ++
 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 71e16b53e9c1..a3d7ec1c9ea7 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -815,7 +815,8 @@ static bool virtqueue_enable_cb_delayed_split(struct 
virtqueue *_vq)
return true;
 }
 
-static void *virtqueue_detach_unused_buf_split(struct virtqueue *_vq)
+static void *virtqueue_detach_unused_buf_ctx_split(struct virtqueue *_vq,
+  void **ctx)
 {
struct vring_virtqueue *vq = to_vvq(_vq);
unsigned int i;
@@ -828,7 +829,7 @@ static void *virtqueue_detach_unused_buf_split(struct 
virtqueue *_vq)
continue;
/* detach_buf_split clears data, so grab it now. */
buf = vq->split.desc_state[i].data;
-   detach_buf_split(vq, i, NULL);
+   detach_buf_split(vq, i, ctx);
vq->split.avail_idx_shadow--;
vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
vq->split.avail_idx_shadow);
@@ -1526,7 +1527,8 @@ static bool virtqueue_enable_cb_delayed_packed(struct 
virtqueue *_vq)
return true;
 }
 
-static void *virtqueue_detach_unused_buf_packed(struct virtqueue *_vq)
+static void *virtqueue_detach_unused_buf_ctx_packed(struct virtqueue *_vq,
+   void **ctx)
 {
struct vring_virtqueue *vq = to_vvq(_vq);
unsigned int i;
@@ -1539,7 +1541,7 @@ static void *virtqueue_detach_unused_buf_packed(struct 
virtqueue *_vq)
continue;
/* detach_buf clears data, so grab it now. */
buf = vq->packed.desc_state[i].data;
-   detach_buf_packed(vq, i, NULL);
+   detach_buf_packed(vq, i, ctx);
END_USE(vq);
return buf;
}
@@ -2018,12 +2020,18 @@ EXPORT_SYMBOL_GPL(virtqueue_enable_cb_delayed);
  * This is not valid on an active queue; it is useful only for device
  * shutdown.
  */
-void *virtqueue_detach_unused_buf(struct virtqueue *_vq)
+void *virtqueue_detach_unused_buf_ctx(struct virtqueue *_vq, void **ctx)
 {
struct vring_virtqueue *vq = to_vvq(_vq);
 
-   return vq->packed_ring ? virtqueue_detach_unused_buf_packed(_vq) :
-virtqueue_detach_unused_buf_split(_vq);
+   return vq->packed_ring ?
+   virtqueue_detach_unused_buf_ctx_packed(_vq, ctx) :
+   virtqueue_detach_unused_buf_ctx_split(_vq, ctx);
+}
+
+void *virtqueue_detach_unused_buf(struct virtqueue *_vq)
+{
+   return virtqueue_detach_unused_buf_ctx(_vq, NULL);
 }
 EXPORT_SYMBOL_GPL(virtqueue_detach_unused_buf);
 
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index b1894e0323fa..8aada4d29e04 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -78,6 +78,8 @@ bool virtqueue_poll(struct virtqueue *vq, unsigned);
 
 bool virtqueue_enable_cb_delayed(struct virtqueue *vq);
 
+void *virtqueue_detach_unused_buf_ctx(struct virtqueue *vq, void **ctx);
+
 void *virtqueue_detach_unused_buf(struct virtqueue *vq);
 
 unsigned int virtqueue_get_vring_size(struct virtqueue *vq);
-- 
2.31.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH net-next v5 10/15] virtio-net: independent directory

2021-06-10 Thread Xuan Zhuo

Create a separate directory for virtio-net. AF_XDP support will be added
later, and a separate xsk.c file will be added.

Signed-off-by: Xuan Zhuo 
---
 MAINTAINERS   |  2 +-
 drivers/net/Kconfig   |  8 +---
 drivers/net/Makefile  |  2 +-
 drivers/net/virtio/Kconfig| 11 +++
 drivers/net/virtio/Makefile   |  6 ++
 drivers/net/{ => virtio}/virtio_net.c |  0
 6 files changed, 20 insertions(+), 9 deletions(-)
 create mode 100644 drivers/net/virtio/Kconfig
 create mode 100644 drivers/net/virtio/Makefile
 rename drivers/net/{ => virtio}/virtio_net.c (100%)

diff --git a/MAINTAINERS b/MAINTAINERS
index e69c1991ec3b..2041267f19f1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -19344,7 +19344,7 @@ S:  Maintained
 F: Documentation/devicetree/bindings/virtio/
 F: drivers/block/virtio_blk.c
 F: drivers/crypto/virtio/
-F: drivers/net/virtio_net.c
+F: drivers/net/virtio/
 F: drivers/vdpa/
 F: drivers/virtio/
 F: include/linux/vdpa.h
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 4da68ba8448f..2297fe4183ae 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -392,13 +392,7 @@ config VETH
  When one end receives the packet it appears on its pair and vice
  versa.
 
-config VIRTIO_NET
-   tristate "Virtio network driver"
-   depends on VIRTIO
-   select NET_FAILOVER
-   help
- This is the virtual network driver for virtio.  It can be used with
- QEMU based VMMs (like KVM or Xen).  Say Y or M.
+source "drivers/net/virtio/Kconfig"
 
 config NLMON
tristate "Virtual netlink monitoring device"
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 7ffd2d03efaf..c4c7419e0398 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -28,7 +28,7 @@ obj-$(CONFIG_NET_TEAM) += team/
 obj-$(CONFIG_TUN) += tun.o
 obj-$(CONFIG_TAP) += tap.o
 obj-$(CONFIG_VETH) += veth.o
-obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
+obj-$(CONFIG_VIRTIO_NET) += virtio/
 obj-$(CONFIG_VXLAN) += vxlan.o
 obj-$(CONFIG_GENEVE) += geneve.o
 obj-$(CONFIG_BAREUDP) += bareudp.o
diff --git a/drivers/net/virtio/Kconfig b/drivers/net/virtio/Kconfig
new file mode 100644
index ..9bc2a2fc6c3e
--- /dev/null
+++ b/drivers/net/virtio/Kconfig
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# virtio-net device configuration
+#
+config VIRTIO_NET
+   tristate "Virtio network driver"
+   depends on VIRTIO
+   select NET_FAILOVER
+   help
+ This is the virtual network driver for virtio.  It can be used with
+ QEMU based VMMs (like KVM or Xen).  Say Y or M.
diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
new file mode 100644
index ..ccc80f40f33a
--- /dev/null
+++ b/drivers/net/virtio/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Makefile for the virtio network device drivers.
+#
+
+obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio/virtio_net.c
similarity index 100%
rename from drivers/net/virtio_net.c
rename to drivers/net/virtio/virtio_net.c
-- 
2.31.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH net-next v5 08/15] virtio-net: split the receive_mergeable function

2021-06-10 Thread Xuan Zhuo

receive_mergeable() is too complicated, so this function is split here.
One is to make the function more readable. On the other hand, the two
independent functions will be called separately in subsequent patches.

Signed-off-by: Xuan Zhuo 
---
 drivers/net/virtio_net.c | 181 ---
 1 file changed, 111 insertions(+), 70 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 3fd87bf2b2ad..989aba600e63 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -733,6 +733,109 @@ static struct page *xdp_linearize_page(struct 
receive_queue *rq,
return NULL;
 }
 
+static void merge_drop_follow_bufs(struct net_device *dev,
+  struct receive_queue *rq,
+  u16 num_buf,
+  struct virtnet_rq_stats *stats)
+{
+   struct page *page;
+   unsigned int len;
+   void *buf;
+
+   while (num_buf-- > 1) {
+   buf = virtqueue_get_buf(rq->vq, &len);
+   if (unlikely(!buf)) {
+   pr_debug("%s: rx error: %d buffers missing\n",
+dev->name, num_buf);
+   dev->stats.rx_length_errors++;
+   break;
+   }
+   stats->bytes += len;
+   page = virt_to_head_page(buf);
+   put_page(page);
+   }
+}
+
+static struct sk_buff *merge_receive_follow_bufs(struct net_device *dev,
+struct virtnet_info *vi,
+struct receive_queue *rq,
+struct sk_buff *head_skb,
+u16 num_buf,
+struct virtnet_rq_stats *stats)
+{
+   struct sk_buff *curr_skb;
+   unsigned int truesize;
+   unsigned int len, num;
+   struct page *page;
+   void *buf, *ctx;
+   int offset;
+
+   curr_skb = head_skb;
+   num = num_buf;
+
+   while (--num_buf) {
+   int num_skb_frags;
+
+   buf = virtqueue_get_buf_ctx(rq->vq, &len, &ctx);
+   if (unlikely(!buf)) {
+   pr_debug("%s: rx error: %d buffers out of %d missing\n",
+dev->name, num_buf, num);
+   dev->stats.rx_length_errors++;
+   goto err_buf;
+   }
+
+   stats->bytes += len;
+   page = virt_to_head_page(buf);
+
+   truesize = mergeable_ctx_to_truesize(ctx);
+   if (unlikely(len > truesize)) {
+   pr_debug("%s: rx error: len %u exceeds truesize %lu\n",
+dev->name, len, (unsigned long)ctx);
+   dev->stats.rx_length_errors++;
+   goto err_skb;
+   }
+
+   num_skb_frags = skb_shinfo(curr_skb)->nr_frags;
+   if (unlikely(num_skb_frags == MAX_SKB_FRAGS)) {
+   struct sk_buff *nskb = alloc_skb(0, GFP_ATOMIC);
+
+   if (unlikely(!nskb))
+   goto err_skb;
+   if (curr_skb == head_skb)
+   skb_shinfo(curr_skb)->frag_list = nskb;
+   else
+   curr_skb->next = nskb;
+   curr_skb = nskb;
+   head_skb->truesize += nskb->truesize;
+   num_skb_frags = 0;
+   }
+   if (curr_skb != head_skb) {
+   head_skb->data_len += len;
+   head_skb->len += len;
+   head_skb->truesize += truesize;
+   }
+   offset = buf - page_address(page);
+   if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) {
+   put_page(page);
+   skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1,
+len, truesize);
+   } else {
+   skb_add_rx_frag(curr_skb, num_skb_frags, page,
+   offset, len, truesize);
+   }
+   }
+
+   return head_skb;
+
+err_skb:
+   put_page(page);
+   merge_drop_follow_bufs(dev, rq, num_buf, stats);
+err_buf:
+   stats->drops++;
+   dev_kfree_skb(head_skb);
+   return NULL;
+}
+
 static struct sk_buff *receive_small(struct net_device *dev,
 struct virtnet_info *vi,
 struct receive_queue *rq,
@@ -909,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
u16 num_buf = virtio16_to_cpu(vi->vdev, hdr->num_buffers);
struct page *page = virt_to_head_page(buf);
int offset = buf - p

[PATCH net-next v5 06/15] virtio-net: unify the code for recycling the xmit ptr

2021-06-10 Thread Xuan Zhuo

Now there are two types of "skb" and "xdp frame" during recycling old
xmit.

There are two completely similar and independent implementations. This
is inconvenient for the subsequent addition of new types. So extract a
function from this piece of code and call this function uniformly to
recover old xmit ptr.

Rename free_old_xmit_skbs() to free_old_xmit().

Signed-off-by: Xuan Zhuo 
---
 drivers/net/virtio_net.c | 86 ++--
 1 file changed, 38 insertions(+), 48 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 6c1233f0ab3e..d791543a8dd8 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -264,6 +264,30 @@ static struct xdp_frame *ptr_to_xdp(void *ptr)
return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
 }
 
+static void __free_old_xmit(struct send_queue *sq, bool in_napi,
+   struct virtnet_sq_stats *stats)
+{
+   unsigned int len;
+   void *ptr;
+
+   while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
+   if (!is_xdp_frame(ptr)) {
+   struct sk_buff *skb = ptr;
+
+   pr_debug("Sent skb %p\n", skb);
+
+   stats->bytes += skb->len;
+   napi_consume_skb(skb, in_napi);
+   } else {
+   struct xdp_frame *frame = ptr_to_xdp(ptr);
+
+   stats->bytes += frame->len;
+   xdp_return_frame(frame);
+   }
+   stats->packets++;
+   }
+}
+
 /* Converting between virtqueue no. and kernel tx/rx queue no.
  * 0:rx0 1:tx0 2:rx1 3:tx1 ... 2N:rxN 2N+1:txN 2N+2:cvq
  */
@@ -572,15 +596,12 @@ static int virtnet_xdp_xmit(struct net_device *dev,
int n, struct xdp_frame **frames, u32 flags)
 {
struct virtnet_info *vi = netdev_priv(dev);
+   struct virtnet_sq_stats stats = {};
struct receive_queue *rq = vi->rq;
struct bpf_prog *xdp_prog;
struct send_queue *sq;
-   unsigned int len;
-   int packets = 0;
-   int bytes = 0;
int nxmit = 0;
int kicks = 0;
-   void *ptr;
int ret;
int i;
 
@@ -599,20 +620,7 @@ static int virtnet_xdp_xmit(struct net_device *dev,
}
 
/* Free up any pending old buffers before queueing new ones. */
-   while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
-   if (likely(is_xdp_frame(ptr))) {
-   struct xdp_frame *frame = ptr_to_xdp(ptr);
-
-   bytes += frame->len;
-   xdp_return_frame(frame);
-   } else {
-   struct sk_buff *skb = ptr;
-
-   bytes += skb->len;
-   napi_consume_skb(skb, false);
-   }
-   packets++;
-   }
+   __free_old_xmit(sq, false, &stats);
 
for (i = 0; i < n; i++) {
struct xdp_frame *xdpf = frames[i];
@@ -629,8 +637,8 @@ static int virtnet_xdp_xmit(struct net_device *dev,
}
 out:
u64_stats_update_begin(&sq->stats.syncp);
-   sq->stats.bytes += bytes;
-   sq->stats.packets += packets;
+   sq->stats.bytes += stats.bytes;
+   sq->stats.packets += stats.packets;
sq->stats.xdp_tx += n;
sq->stats.xdp_tx_drops += n - nxmit;
sq->stats.kicks += kicks;
@@ -1459,39 +1467,21 @@ static int virtnet_receive(struct receive_queue *rq, 
int budget,
return stats.packets;
 }
 
-static void free_old_xmit_skbs(struct send_queue *sq, bool in_napi)
+static void free_old_xmit(struct send_queue *sq, bool in_napi)
 {
-   unsigned int len;
-   unsigned int packets = 0;
-   unsigned int bytes = 0;
-   void *ptr;
+   struct virtnet_sq_stats stats = {};
 
-   while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
-   if (likely(!is_xdp_frame(ptr))) {
-   struct sk_buff *skb = ptr;
-
-   pr_debug("Sent skb %p\n", skb);
-
-   bytes += skb->len;
-   napi_consume_skb(skb, in_napi);
-   } else {
-   struct xdp_frame *frame = ptr_to_xdp(ptr);
-
-   bytes += frame->len;
-   xdp_return_frame(frame);
-   }
-   packets++;
-   }
+   __free_old_xmit(sq, in_napi, &stats);
 
/* Avoid overhead when no packets have been processed
 * happens when called speculatively from start_xmit.
 */
-   if (!packets)
+   if (!stats.packets)
return;
 
u64_stats_update_begin(&sq->stats.syncp);
-   sq->stats.bytes += bytes;
-   sq->stats.packets += packets;
+   sq->stats.bytes += stats.bytes;
+   sq->stats.packets += stats.packets;
u64_stats_update_end(&sq->stats.syncp);
 }
 
@@ -1516,7 +1506,7 @@ static void virtnet_pol

[PATCH net-next v5 09/15] virtio-net: virtnet_poll_tx support budget check

2021-06-10 Thread Xuan Zhuo

virtnet_poll_tx() check the work done like other network card drivers.

When work < budget, napi_poll() in dev.c will exit directly. And
virtqueue_napi_complete() will be called to close napi. If closing napi
fails or there is still data to be processed, virtqueue_napi_complete()
will make napi schedule again, and no conflicts with the logic of
napi_poll().

When work == budget, virtnet_poll_tx() will return the var 'work', and
the napi_poll() in dev.c will re-add napi to the queue.

The purpose of this patch is to support xsk xmit in virtio_poll_tx for
subsequent patch.

Signed-off-by: Xuan Zhuo 
Acked-by: Jason Wang 
---
 drivers/net/virtio_net.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 989aba600e63..953739860563 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1634,6 +1634,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int 
budget)
struct virtnet_info *vi = sq->vq->vdev->priv;
unsigned int index = vq2txq(sq->vq);
struct netdev_queue *txq;
+   int work_done = 0;
 
if (unlikely(is_xdp_raw_buffer_queue(vi, index))) {
/* We don't need to enable cb for XDP */
@@ -1646,12 +1647,13 @@ static int virtnet_poll_tx(struct napi_struct *napi, 
int budget)
free_old_xmit(sq, true);
__netif_tx_unlock(txq);
 
-   virtqueue_napi_complete(napi, sq->vq, 0);
+   if (work_done < budget)
+   virtqueue_napi_complete(napi, sq->vq, 0);
 
if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
netif_tx_wake_queue(txq);
 
-   return 0;
+   return work_done;
 }
 
 static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
-- 
2.31.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH net-next v5 00/15] virtio-net: support xdp socket zero copy

2021-06-10 Thread Xuan Zhuo

XDP socket is an excellent by pass kernel network transmission framework. The
zero copy feature of xsk (XDP socket) needs to be supported by the driver. The
performance of zero copy is very good. mlx5 and intel ixgbe already support this
feature, This patch set allows virtio-net to support xsk's zerocopy xmit
feature.

Compared with other drivers, the trouble with virtio-net is that when we bind
the channel to xsk, we cannot directly disable/enable the channel. So we have to
consider the buf that has been placed in the vq after the xsk is released by the
upper layer.

My solution is to add a ctx to each buf placed in vq to record the page used,
and add a reference to the page. So when the upper xsk is released, these pages
are still safe in vq. We will process these bufs when we recycle buf or
receive new data.

In the case of rx, it will be more complicated, because we may encounter the buf
of xsk, or it may be a normal buf. Especially in the case of merge, we may
receive multiple bufs, and these bufs may be xsk buf and normal are
mixed together.

v5:
   support rx

v4:
1. add priv_flags IFF_NOT_USE_DMA_ADDR
2. more reasonable patch split

Xuan Zhuo (15):
  netdevice: priv_flags extend to 64bit
  netdevice: add priv_flags IFF_NOT_USE_DMA_ADDR
  virtio-net: add priv_flags IFF_NOT_USE_DMA_ADDR
  xsk: XDP_SETUP_XSK_POOL support option IFF_NOT_USE_DMA_ADDR
  virtio: support virtqueue_detach_unused_buf_ctx
  virtio-net: unify the code for recycling the xmit ptr
  virtio-net: standalone virtnet_aloc_frag function
  virtio-net: split the receive_mergeable function
  virtio-net: virtnet_poll_tx support budget check
  virtio-net: independent directory
  virtio-net: move to virtio_net.h
  virtio-net: support AF_XDP zc tx
  virtio-net: support AF_XDP zc rx
  virtio-net: xsk direct xmit inside xsk wakeup
  virtio-net: xsk zero copy xmit kick by threshold

 MAINTAINERS   |   2 +-
 drivers/net/Kconfig   |   8 +-
 drivers/net/Makefile  |   2 +-
 drivers/net/virtio/Kconfig|  11 +
 drivers/net/virtio/Makefile   |   7 +
 drivers/net/{ => virtio}/virtio_net.c | 670 +++---
 drivers/net/virtio/virtio_net.h   | 288 ++
 drivers/net/virtio/xsk.c  | 766 ++
 drivers/net/virtio/xsk.h  | 176 ++
 drivers/virtio/virtio_ring.c  |  22 +-
 include/linux/netdevice.h | 143 ++---
 include/linux/virtio.h|   2 +
 net/8021q/vlanproc.c  |   2 +-
 net/xdp/xsk_buff_pool.c   |   2 +-
 14 files changed, 1664 insertions(+), 437 deletions(-)
 create mode 100644 drivers/net/virtio/Kconfig
 create mode 100644 drivers/net/virtio/Makefile
 rename drivers/net/{ => virtio}/virtio_net.c (92%)
 create mode 100644 drivers/net/virtio/virtio_net.h
 create mode 100644 drivers/net/virtio/xsk.c
 create mode 100644 drivers/net/virtio/xsk.h

--
2.31.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH net-next v5 02/15] netdevice: add priv_flags IFF_NOT_USE_DMA_ADDR

2021-06-10 Thread Xuan Zhuo

Some driver devices, such as virtio-net, do not directly use dma addr.
For upper-level frameworks such as xdp socket, that need to be aware of
this. So add a new priv_flag IFF_NOT_USE_DMA_ADDR.

Signed-off-by: Xuan Zhuo 
---
 include/linux/netdevice.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3202e055b305..2de0a0c455e5 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1596,6 +1596,7 @@ typedef u64 netdev_priv_flags_t;
  * @IFF_LIVE_RENAME_OK: rename is allowed while device is up and running
  * @IFF_TX_SKB_NO_LINEAR: device/driver is capable of xmitting frames with
  * skb_headlen(skb) == 0 (data starts from frag0)
+ * @IFF_NOT_USE_DMA_ADDR: driver not use dma addr directly. such as virtio-net
  */
 enum {
IFF_802_1Q_VLAN_BIT,
@@ -1630,6 +1631,7 @@ enum {
IFF_L3MDEV_RX_HANDLER_BIT,
IFF_LIVE_RENAME_OK_BIT,
IFF_TX_SKB_NO_LINEAR_BIT,
+   IFF_NOT_USE_DMA_ADDR_BIT,
 };
 
 #define __IFF_BIT(bit) ((netdev_priv_flags_t)1 << (bit))
@@ -1667,6 +1669,7 @@ enum {
 #define IFF_L3MDEV_RX_HANDLER  __IFF(L3MDEV_RX_HANDLER)
 #define IFF_LIVE_RENAME_OK __IFF(LIVE_RENAME_OK)
 #define IFF_TX_SKB_NO_LINEAR   __IFF(TX_SKB_NO_LINEAR)
+#define IFF_NOT_USE_DMA_ADDR   __IFF(NOT_USE_DMA_ADDR)
 
 /* Specifies the type of the struct net_device::ml_priv pointer */
 enum netdev_ml_priv_type {
-- 
2.31.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH net-next v5 04/15] xsk: XDP_SETUP_XSK_POOL support option IFF_NOT_USE_DMA_ADDR

2021-06-10 Thread Xuan Zhuo

Some devices, such as virtio-net, do not directly use dma addr. These
devices do not initialize dma after completing the xsk setup, so the dma
check is skipped here.

Signed-off-by: Xuan Zhuo 
Reviewed-by: Dust Li 
Acked-by: Magnus Karlsson 
---
 net/xdp/xsk_buff_pool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 8de01aaac4a0..a7e434de0308 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -171,7 +171,7 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
if (err)
goto err_unreg_pool;
 
-   if (!pool->dma_pages) {
+   if (!(netdev->priv_flags & IFF_NOT_USE_DMA_ADDR) && !pool->dma_pages) {
WARN(1, "Driver did not DMA map zero-copy buffers");
err = -EINVAL;
goto err_unreg_xsk;
-- 
2.31.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH net-next v5 01/15] netdevice: priv_flags extend to 64bit

2021-06-10 Thread Xuan Zhuo

The size of priv_flags is 32 bits, and the number of flags currently
available has reached 32. It is time to expand the size of priv_flags to
64 bits.

Here the priv_flags is modified to 8 bytes, but the size of struct
net_device has not changed, it is still 2176 bytes. It is because _tx is
aligned based on the cache line. But there is a 4-byte hole left here.

Since the fields before and after priv_flags are read mostly, I did not
adjust the order of the fields here.

Signed-off-by: Xuan Zhuo 
---
 include/linux/netdevice.h | 140 --
 net/8021q/vlanproc.c  |   2 +-
 2 files changed, 73 insertions(+), 69 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index be1dcceda5e4..3202e055b305 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1549,9 +1549,9 @@ struct net_device_ops {
  struct 
net_device_path *path);
 };
 
+typedef u64 netdev_priv_flags_t;
+
 /**
- * enum netdev_priv_flags - &struct net_device priv_flags
- *
  * These are the &struct net_device, they are only set internally
  * by drivers and used in the kernel. These flags are invisible to
  * userspace; this means that the order of these flags can change
@@ -1597,73 +1597,76 @@ struct net_device_ops {
  * @IFF_TX_SKB_NO_LINEAR: device/driver is capable of xmitting frames with
  * skb_headlen(skb) == 0 (data starts from frag0)
  */
-enum netdev_priv_flags {
-   IFF_802_1Q_VLAN = 1<<0,
-   IFF_EBRIDGE = 1<<1,
-   IFF_BONDING = 1<<2,
-   IFF_ISATAP  = 1<<3,
-   IFF_WAN_HDLC= 1<<4,
-   IFF_XMIT_DST_RELEASE= 1<<5,
-   IFF_DONT_BRIDGE = 1<<6,
-   IFF_DISABLE_NETPOLL = 1<<7,
-   IFF_MACVLAN_PORT= 1<<8,
-   IFF_BRIDGE_PORT = 1<<9,
-   IFF_OVS_DATAPATH= 1<<10,
-   IFF_TX_SKB_SHARING  = 1<<11,
-   IFF_UNICAST_FLT = 1<<12,
-   IFF_TEAM_PORT   = 1<<13,
-   IFF_SUPP_NOFCS  = 1<<14,
-   IFF_LIVE_ADDR_CHANGE= 1<<15,
-   IFF_MACVLAN = 1<<16,
-   IFF_XMIT_DST_RELEASE_PERM   = 1<<17,
-   IFF_L3MDEV_MASTER   = 1<<18,
-   IFF_NO_QUEUE= 1<<19,
-   IFF_OPENVSWITCH = 1<<20,
-   IFF_L3MDEV_SLAVE= 1<<21,
-   IFF_TEAM= 1<<22,
-   IFF_RXFH_CONFIGURED = 1<<23,
-   IFF_PHONY_HEADROOM  = 1<<24,
-   IFF_MACSEC  = 1<<25,
-   IFF_NO_RX_HANDLER   = 1<<26,
-   IFF_FAILOVER= 1<<27,
-   IFF_FAILOVER_SLAVE  = 1<<28,
-   IFF_L3MDEV_RX_HANDLER   = 1<<29,
-   IFF_LIVE_RENAME_OK  = 1<<30,
-   IFF_TX_SKB_NO_LINEAR= 1<<31,
+enum {
+   IFF_802_1Q_VLAN_BIT,
+   IFF_EBRIDGE_BIT,
+   IFF_BONDING_BIT,
+   IFF_ISATAP_BIT,
+   IFF_WAN_HDLC_BIT,
+   IFF_XMIT_DST_RELEASE_BIT,
+   IFF_DONT_BRIDGE_BIT,
+   IFF_DISABLE_NETPOLL_BIT,
+   IFF_MACVLAN_PORT_BIT,
+   IFF_BRIDGE_PORT_BIT,
+   IFF_OVS_DATAPATH_BIT,
+   IFF_TX_SKB_SHARING_BIT,
+   IFF_UNICAST_FLT_BIT,
+   IFF_TEAM_PORT_BIT,
+   IFF_SUPP_NOFCS_BIT,
+   IFF_LIVE_ADDR_CHANGE_BIT,
+   IFF_MACVLAN_BIT,
+   IFF_XMIT_DST_RELEASE_PERM_BIT,
+   IFF_L3MDEV_MASTER_BIT,
+   IFF_NO_QUEUE_BIT,
+   IFF_OPENVSWITCH_BIT,
+   IFF_L3MDEV_SLAVE_BIT,
+   IFF_TEAM_BIT,
+   IFF_RXFH_CONFIGURED_BIT,
+   IFF_PHONY_HEADROOM_BIT,
+   IFF_MACSEC_BIT,
+   IFF_NO_RX_HANDLER_BIT,
+   IFF_FAILOVER_BIT,
+   IFF_FAILOVER_SLAVE_BIT,
+   IFF_L3MDEV_RX_HANDLER_BIT,
+   IFF_LIVE_RENAME_OK_BIT,
+   IFF_TX_SKB_NO_LINEAR_BIT,
 };
 
-#define IFF_802_1Q_VLANIFF_802_1Q_VLAN
-#define IFF_EBRIDGEIFF_EBRIDGE
-#define IFF_BONDINGIFF_BONDING
-#define IFF_ISATAP IFF_ISATAP
-#define IFF_WAN_HDLC   IFF_WAN_HDLC
-#define IFF_XMIT_DST_RELEASE   IFF_XMIT_DST_RELEASE
-#define IFF_DONT_BRIDGEIFF_DONT_BRIDGE
-#define IFF_DISABLE_NETPOLLIFF_DISABLE_NETPOLL
-#define IFF_MACVLAN_PORT   IFF_MACVLAN_PORT
-#define IFF_BRIDGE_PORTIFF_BRIDGE_PORT
-#define IFF_OVS_DATAPATH   IFF_OVS_DATAPATH
-#define IFF_TX_SKB_SHARING IFF_TX_SKB_SHARING
-#define IFF_UNICAST_FLTIFF_UNICAST_FLT
-#define IFF_TEAM_PORT  IFF_TEAM_PORT
-#define IFF_SUPP_NOFCS IFF_SUPP_NOFCS
-#define IFF_LIVE_ADDR_CHANGE   IFF_LIVE_ADDR_CHANGE
-#define IFF_MACVLANIFF_MACVLAN
-#define

Re: [PATCH 7/9] vhost: allow userspace to create workers

2021-06-10 Thread Stefan Hajnoczi

On Wed, Jun 09, 2021 at 04:03:55PM -0500, Mike Christie wrote:
> On 6/7/21 10:19 AM, Stefan Hajnoczi wrote:
> > My concern is that threads should probably accounted against
> > RLIMIT_NPROC and max_threads rather than something indirect like 128 *
> > RLIMIT_NOFILE (a userspace process can only have RLIMIT_NOFILE
> > vhost-user file descriptors open).
> > 
> 
> Ah ok, I see what you want I think.
> 
> Ok, I think the options are:
> 
> 0. Nothing. Just use existing indirect/RLIMIT_NOFILE.
> 
> 1. Do something like io_uring's create_io_thread/copy_process. If we call
> copy_process from the vhost ioctl context, then the userspace process that
> did the ioctl will have it's processes count incremented and checked against
> its rlimit.
> 
> The drawbacks:
> - This gets a little more complicated than just calling copy_process though.
> We end up duplicating a lot of the kthread API.
> - We have to deal with new error cases like the parent exiting early.
> - I think all devs sharing a worker have to have the same owner. 
> kthread_use_mm
> and kthread_unuse_mm to switch between mm's for differrent owner's devs seem 
> to
> be causing lots of errors. I'm still looking into this one though.
> 
> 2.  It's not really what you want, but for unbound work io_uring has a check 
> for
> RLIMIT_NPROC in the io_uring code. It does:
> 
> wqe->acct[IO_WQ_ACCT_UNBOUND].max_workers =
>   task_rlimit(current, RLIMIT_NPROC);
> 
> then does:
> 
> if (!ret && acct->nr_workers < acct->max_workers) {
> 
> Drawbacks:
> In vhost.c, we could do something similar. It would make sure that vhost.c 
> does
> not create more worker threads than the rlimit value, but we wouldn't be
> incrementing the userspace process's process count. The userspace process 
> could
> then create RLIMIT_NPROC threads and vhost.c could also create RLIMIT_NPROC
> threads, so we end up with 2 * RLIMIT_NPROC threads.

Yes, in that case we might as well go with Option 0, so I think this
option can be eliminated.

> 3. Change the kthread and copy_process code so we can pass in the thread
> (or it's creds or some struct that has the values that need to be check) that
> needs to be checked and updated.
> 
> Drawback:
> This might be considered too ugly for how special case vhost is. For example, 
> we
> need checks/code like the io_thread/PF_IO_WORKER code in copy_process for 
> io_uring.
> I can see how added that for io_uring because it affects so many users, but I 
> can
> see how vhost is not special enough.

This seems like the most general solution. If you try it and get
negative feedback then maybe the maintainers can help suggest how to
solve this problem :).

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v4 4/6] iommu/dma: Pass address limit rather than size to iommu_setup_dma_ops()

2021-06-10 Thread Jean-Philippe Brucker

Passing a 64-bit address width to iommu_setup_dma_ops() is valid on
virtual platforms, but isn't currently possible. The overflow check in
iommu_dma_init_domain() prevents this even when @dma_base isn't 0. Pass
a limit address instead of a size, so callers don't have to fake a size
to work around the check.

Signed-off-by: Jean-Philippe Brucker 
---
 include/linux/dma-iommu.h   |  4 ++--
 arch/arm64/mm/dma-mapping.c |  2 +-
 drivers/iommu/amd/iommu.c   |  2 +-
 drivers/iommu/dma-iommu.c   | 12 ++--
 drivers/iommu/intel/iommu.c |  2 +-
 5 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/include/linux/dma-iommu.h b/include/linux/dma-iommu.h
index 6e75a2d689b4..758ca4694257 100644
--- a/include/linux/dma-iommu.h
+++ b/include/linux/dma-iommu.h
@@ -19,7 +19,7 @@ int iommu_get_msi_cookie(struct iommu_domain *domain, 
dma_addr_t base);
 void iommu_put_dma_cookie(struct iommu_domain *domain);
 
 /* Setup call for arch DMA mapping code */
-void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size);
+void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 dma_limit);
 
 /* The DMA API isn't _quite_ the whole story, though... */
 /*
@@ -50,7 +50,7 @@ struct msi_msg;
 struct device;
 
 static inline void iommu_setup_dma_ops(struct device *dev, u64 dma_base,
-   u64 size)
+  u64 dma_limit)
 {
 }
 
diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index 4bf1dd3eb041..7bd1d2199141 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -50,7 +50,7 @@ void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 
size,
 
dev->dma_coherent = coherent;
if (iommu)
-   iommu_setup_dma_ops(dev, dma_base, size);
+   iommu_setup_dma_ops(dev, dma_base, size - dma_base - 1);
 
 #ifdef CONFIG_XEN
if (xen_swiotlb_detect())
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 3ac42bbdefc6..94b96d81fcfd 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -1713,7 +1713,7 @@ static void amd_iommu_probe_finalize(struct device *dev)
/* Domains are initialized for this device - have a look what we ended 
up with */
domain = iommu_get_domain_for_dev(dev);
if (domain->type == IOMMU_DOMAIN_DMA)
-   iommu_setup_dma_ops(dev, IOVA_START_PFN << PAGE_SHIFT, 0);
+   iommu_setup_dma_ops(dev, IOVA_START_PFN << PAGE_SHIFT, U64_MAX);
else
set_dma_ops(dev, NULL);
 }
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 7bcdd1205535..c62e19bed302 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -319,16 +319,16 @@ static bool dev_is_untrusted(struct device *dev)
  * iommu_dma_init_domain - Initialise a DMA mapping domain
  * @domain: IOMMU domain previously prepared by iommu_get_dma_cookie()
  * @base: IOVA at which the mappable address space starts
- * @size: Size of IOVA space
+ * @limit: Last address of the IOVA space
  * @dev: Device the domain is being initialised for
  *
- * @base and @size should be exact multiples of IOMMU page granularity to
+ * @base and @limit + 1 should be exact multiples of IOMMU page granularity to
  * avoid rounding surprises. If necessary, we reserve the page at address 0
  * to ensure it is an invalid IOVA. It is safe to reinitialise a domain, but
  * any change which could make prior IOVAs invalid will fail.
  */
 static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base,
-   u64 size, struct device *dev)
+dma_addr_t limit, struct device *dev)
 {
struct iommu_dma_cookie *cookie = domain->iova_cookie;
unsigned long order, base_pfn;
@@ -346,7 +346,7 @@ static int iommu_dma_init_domain(struct iommu_domain 
*domain, dma_addr_t base,
/* Check the domain allows at least some access to the device... */
if (domain->geometry.force_aperture) {
if (base > domain->geometry.aperture_end ||
-   base + size <= domain->geometry.aperture_start) {
+   limit < domain->geometry.aperture_start) {
pr_warn("specified DMA range outside IOMMU 
capability\n");
return -EFAULT;
}
@@ -1308,7 +1308,7 @@ static const struct dma_map_ops iommu_dma_ops = {
  * The IOMMU core code allocates the default DMA domain, which the underlying
  * IOMMU driver needs to support via the dma-iommu layer.
  */
-void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size)
+void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 dma_limit)
 {
struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
 
@@ -1320,7 +1320,7 @@ void iommu_setup_dma_ops(struct device *dev, u64 
dma_base, u64 size)
 * underlying IOMMU driver needs to support via the dma-iommu layer.
 */
if (domain->type == IOM

[PATCH v4 3/6] ACPI: Add driver for the VIOT table

2021-06-10 Thread Jean-Philippe Brucker

The ACPI Virtual I/O Translation Table describes topology of
para-virtual platforms, similarly to vendor tables DMAR, IVRS and IORT.
For now it describes the relation between virtio-iommu and the endpoints
it manages.

Three steps are needed to configure DMA of endpoints:

(1) acpi_viot_init(): parse the VIOT table, find or create the fwnode
associated to each vIOMMU device.

(2) When probing the vIOMMU device, the driver registers its IOMMU ops
within the IOMMU subsystem. This step doesn't require any
intervention from the VIOT driver.

(3) viot_iommu_configure(): before binding the endpoint to a driver,
find the associated IOMMU ops. Register them, along with the
endpoint ID, into the device's iommu_fwspec.

If step (3) happens before step (2), it is deferred until the IOMMU is
initialized, then retried.

Signed-off-by: Jean-Philippe Brucker 
---
 drivers/acpi/Kconfig  |   3 +
 drivers/iommu/Kconfig |   1 +
 drivers/acpi/Makefile |   2 +
 include/linux/acpi_viot.h |  19 ++
 drivers/acpi/bus.c|   2 +
 drivers/acpi/scan.c   |   3 +
 drivers/acpi/viot.c   | 364 ++
 MAINTAINERS   |   8 +
 8 files changed, 402 insertions(+)
 create mode 100644 include/linux/acpi_viot.h
 create mode 100644 drivers/acpi/viot.c

diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
index eedec61e3476..3758c6940ed7 100644
--- a/drivers/acpi/Kconfig
+++ b/drivers/acpi/Kconfig
@@ -526,6 +526,9 @@ endif
 
 source "drivers/acpi/pmic/Kconfig"
 
+config ACPI_VIOT
+   bool
+
 endif  # ACPI
 
 config X86_PM_TIMER
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 1f111b399bca..aff8a4830dd1 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -403,6 +403,7 @@ config VIRTIO_IOMMU
depends on ARM64
select IOMMU_API
select INTERVAL_TREE
+   select ACPI_VIOT if ACPI
help
  Para-virtualised IOMMU driver with virtio.
 
diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
index 700b41adf2db..a6e644c48987 100644
--- a/drivers/acpi/Makefile
+++ b/drivers/acpi/Makefile
@@ -118,3 +118,5 @@ video-objs  += acpi_video.o video_detect.o
 obj-y  += dptf/
 
 obj-$(CONFIG_ARM64)+= arm64/
+
+obj-$(CONFIG_ACPI_VIOT)+= viot.o
diff --git a/include/linux/acpi_viot.h b/include/linux/acpi_viot.h
new file mode 100644
index ..1eb8ee5b0e5f
--- /dev/null
+++ b/include/linux/acpi_viot.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef __ACPI_VIOT_H__
+#define __ACPI_VIOT_H__
+
+#include 
+
+#ifdef CONFIG_ACPI_VIOT
+void __init acpi_viot_init(void);
+int viot_iommu_configure(struct device *dev);
+#else
+static inline void acpi_viot_init(void) {}
+static inline int viot_iommu_configure(struct device *dev)
+{
+   return -ENODEV;
+}
+#endif
+
+#endif /* __ACPI_VIOT_H__ */
diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c
index be7da23fad76..b835ca702ff0 100644
--- a/drivers/acpi/bus.c
+++ b/drivers/acpi/bus.c
@@ -27,6 +27,7 @@
 #include 
 #endif
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1339,6 +1340,7 @@ static int __init acpi_init(void)
pci_mmcfg_late_init();
acpi_iort_init();
acpi_scan_init();
+   acpi_viot_init();
acpi_ec_init();
acpi_debugfs_init();
acpi_sleep_proc_init();
diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
index 0c53c8533300..4fa684fdfda8 100644
--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1556,6 +1557,8 @@ static const struct iommu_ops 
*acpi_iommu_configure_id(struct device *dev,
return ops;
 
err = iort_iommu_configure_id(dev, id_in);
+   if (err && err != -EPROBE_DEFER)
+   err = viot_iommu_configure(dev);
 
/*
 * If we have reason to believe the IOMMU driver missed the initial
diff --git a/drivers/acpi/viot.c b/drivers/acpi/viot.c
new file mode 100644
index ..892cd9fa7b6d
--- /dev/null
+++ b/drivers/acpi/viot.c
@@ -0,0 +1,364 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Virtual I/O topology
+ *
+ * The Virtual I/O Translation Table (VIOT) describes the topology of
+ * para-virtual IOMMUs and the endpoints they manage. The OS uses it to
+ * initialize devices in the right order, preventing endpoints from issuing DMA
+ * before their IOMMU is ready.
+ *
+ * When binding a driver to a device, before calling the device driver's 
probe()
+ * method, the driver infrastructure calls dma_configure(). At that point the
+ * VIOT driver looks for an IOMMU associated to the device in the VIOT table.
+ * If an IOMMU exists and has been initialized, the VIOT driver initializes the
+ * device's IOMMU fwspec, allowing the DMA infrastructure to invoke the IOMMU
+ * ops when the device driver configures DMA mappings. If an IOMMU exists a

[PATCH v4 5/6] iommu/dma: Simplify calls to iommu_setup_dma_ops()

2021-06-10 Thread Jean-Philippe Brucker

dma-iommu uses the address bounds described in domain->geometry during
IOVA allocation. The address size parameters of iommu_setup_dma_ops()
are useful for describing additional limits set by the platform
firmware, but aren't needed for drivers that call this function from
probe_finalize(). The base parameter can be zero because dma-iommu
already removes the first IOVA page, and the limit parameter can be
U64_MAX because it's only checked against the domain geometry. Simplify
calls to iommu_setup_dma_ops().

Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/amd/iommu.c   |  9 +
 drivers/iommu/dma-iommu.c   |  4 +++-
 drivers/iommu/intel/iommu.c | 10 +-
 3 files changed, 5 insertions(+), 18 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 94b96d81fcfd..d3123bc05c08 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -1708,14 +1708,7 @@ static struct iommu_device 
*amd_iommu_probe_device(struct device *dev)
 
 static void amd_iommu_probe_finalize(struct device *dev)
 {
-   struct iommu_domain *domain;
-
-   /* Domains are initialized for this device - have a look what we ended 
up with */
-   domain = iommu_get_domain_for_dev(dev);
-   if (domain->type == IOMMU_DOMAIN_DMA)
-   iommu_setup_dma_ops(dev, IOVA_START_PFN << PAGE_SHIFT, U64_MAX);
-   else
-   set_dma_ops(dev, NULL);
+   iommu_setup_dma_ops(dev, 0, U64_MAX);
 }
 
 static void amd_iommu_release_device(struct device *dev)
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index c62e19bed302..175f8eaeb5b3 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1322,7 +1322,9 @@ void iommu_setup_dma_ops(struct device *dev, u64 
dma_base, u64 dma_limit)
if (domain->type == IOMMU_DOMAIN_DMA) {
if (iommu_dma_init_domain(domain, dma_base, dma_limit, dev))
goto out_err;
-   dev->dma_ops = &iommu_dma_ops;
+   set_dma_ops(dev, &iommu_dma_ops);
+   } else {
+   set_dma_ops(dev, NULL);
}
 
return;
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 85f18342603c..8d866940692a 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5165,15 +5165,7 @@ static void intel_iommu_release_device(struct device 
*dev)
 
 static void intel_iommu_probe_finalize(struct device *dev)
 {
-   dma_addr_t base = IOVA_START_PFN << VTD_PAGE_SHIFT;
-   struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
-   struct dmar_domain *dmar_domain = to_dmar_domain(domain);
-
-   if (domain && domain->type == IOMMU_DOMAIN_DMA)
-   iommu_setup_dma_ops(dev, base,
-   __DOMAIN_MAX_ADDR(dmar_domain->gaw));
-   else
-   set_dma_ops(dev, NULL);
+   iommu_setup_dma_ops(dev, 0, U64_MAX);
 }
 
 static void intel_iommu_get_resv_regions(struct device *device,
-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v4 2/6] ACPI: Move IOMMU setup code out of IORT

2021-06-10 Thread Jean-Philippe Brucker

Extract the code that sets up the IOMMU infrastructure from IORT, since
it can be reused by VIOT. Move it one level up into a new
acpi_iommu_configure_id() function, which calls the IORT parsing
function which in turn calls the acpi_iommu_fwspec_init() helper.

Signed-off-by: Jean-Philippe Brucker 
---
 include/acpi/acpi_bus.h   |  3 ++
 include/linux/acpi_iort.h |  8 ++---
 drivers/acpi/arm64/iort.c | 75 +--
 drivers/acpi/scan.c   | 73 -
 4 files changed, 87 insertions(+), 72 deletions(-)

diff --git a/include/acpi/acpi_bus.h b/include/acpi/acpi_bus.h
index 3a82faac5767..41f092a269f6 100644
--- a/include/acpi/acpi_bus.h
+++ b/include/acpi/acpi_bus.h
@@ -588,6 +588,9 @@ struct acpi_pci_root {
 
 bool acpi_dma_supported(struct acpi_device *adev);
 enum dev_dma_attr acpi_get_dma_attr(struct acpi_device *adev);
+int acpi_iommu_fwspec_init(struct device *dev, u32 id,
+  struct fwnode_handle *fwnode,
+  const struct iommu_ops *ops);
 int acpi_dma_get_range(struct device *dev, u64 *dma_addr, u64 *offset,
   u64 *size);
 int acpi_dma_configure_id(struct device *dev, enum dev_dma_attr attr,
diff --git a/include/linux/acpi_iort.h b/include/linux/acpi_iort.h
index f7f054833afd..f1f0842a2cb2 100644
--- a/include/linux/acpi_iort.h
+++ b/include/linux/acpi_iort.h
@@ -35,8 +35,7 @@ void acpi_configure_pmsi_domain(struct device *dev);
 int iort_pmsi_get_dev_id(struct device *dev, u32 *dev_id);
 /* IOMMU interface */
 int iort_dma_get_ranges(struct device *dev, u64 *size);
-const struct iommu_ops *iort_iommu_configure_id(struct device *dev,
-   const u32 *id_in);
+int iort_iommu_configure_id(struct device *dev, const u32 *id_in);
 int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head 
*head);
 phys_addr_t acpi_iort_dma_get_max_cpu_address(void);
 #else
@@ -50,9 +49,8 @@ static inline void acpi_configure_pmsi_domain(struct device 
*dev) { }
 /* IOMMU interface */
 static inline int iort_dma_get_ranges(struct device *dev, u64 *size)
 { return -ENODEV; }
-static inline const struct iommu_ops *iort_iommu_configure_id(
- struct device *dev, const u32 *id_in)
-{ return NULL; }
+static inline int iort_iommu_configure_id(struct device *dev, const u32 *id_in)
+{ return -ENODEV; }
 static inline
 int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head)
 { return 0; }
diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index a940be1cf2af..b5b021e064b6 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -806,23 +806,6 @@ static struct acpi_iort_node 
*iort_get_msi_resv_iommu(struct device *dev)
return NULL;
 }
 
-static inline const struct iommu_ops *iort_fwspec_iommu_ops(struct device *dev)
-{
-   struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
-
-   return (fwspec && fwspec->ops) ? fwspec->ops : NULL;
-}
-
-static inline int iort_add_device_replay(struct device *dev)
-{
-   int err = 0;
-
-   if (dev->bus && !device_iommu_mapped(dev))
-   err = iommu_probe_device(dev);
-
-   return err;
-}
-
 /**
  * iort_iommu_msi_get_resv_regions - Reserved region driver helper
  * @dev: Device from iommu_get_resv_regions()
@@ -900,18 +883,6 @@ static inline bool iort_iommu_driver_enabled(u8 type)
}
 }
 
-static int arm_smmu_iort_xlate(struct device *dev, u32 streamid,
-  struct fwnode_handle *fwnode,
-  const struct iommu_ops *ops)
-{
-   int ret = iommu_fwspec_init(dev, fwnode, ops);
-
-   if (!ret)
-   ret = iommu_fwspec_add_ids(dev, &streamid, 1);
-
-   return ret;
-}
-
 static bool iort_pci_rc_supports_ats(struct acpi_iort_node *node)
 {
struct acpi_iort_root_complex *pci_rc;
@@ -946,7 +917,7 @@ static int iort_iommu_xlate(struct device *dev, struct 
acpi_iort_node *node,
return iort_iommu_driver_enabled(node->type) ?
   -EPROBE_DEFER : -ENODEV;
 
-   return arm_smmu_iort_xlate(dev, streamid, iort_fwnode, ops);
+   return acpi_iommu_fwspec_init(dev, streamid, iort_fwnode, ops);
 }
 
 struct iort_pci_alias_info {
@@ -1020,24 +991,14 @@ static int iort_nc_iommu_map_id(struct device *dev,
  * @dev: device to configure
  * @id_in: optional input id const value pointer
  *
- * Returns: iommu_ops pointer on configuration success
- *  NULL on configuration failure
+ * Returns: 0 on success, <0 on failure
  */
-const struct iommu_ops *iort_iommu_configure_id(struct device *dev,
-   const u32 *id_in)
+int iort_iommu_configure_id(struct device *dev, const u32 *id_in)
 {
struct acpi_iort_node *node;
-   const struct iommu_ops *ops;
+   const struct iommu_ops *ops = NULL;
int err = -ENODEV;
 
-

[PATCH v4 6/6] iommu/virtio: Enable x86 support

2021-06-10 Thread Jean-Philippe Brucker

With the VIOT support in place, x86 platforms can now use the
virtio-iommu.

Because the other x86 IOMMU drivers aren't yet ready to use the
acpi_dma_setup() path, x86 doesn't implement arch_setup_dma_ops() at the
moment. Similarly to Vt-d and AMD IOMMU, call iommu_setup_dma_ops() from
probe_finalize().

Acked-by: Joerg Roedel 
Acked-by: Michael S. Tsirkin 
Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/Kconfig| 3 ++-
 drivers/iommu/dma-iommu.c| 1 +
 drivers/iommu/virtio-iommu.c | 8 
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index aff8a4830dd1..07b7c25cbed8 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -400,8 +400,9 @@ config HYPERV_IOMMU
 config VIRTIO_IOMMU
tristate "Virtio IOMMU driver"
depends on VIRTIO
-   depends on ARM64
+   depends on (ARM64 || X86)
select IOMMU_API
+   select IOMMU_DMA
select INTERVAL_TREE
select ACPI_VIOT if ACPI
help
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 175f8eaeb5b3..46ed43c400cf 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1332,6 +1332,7 @@ void iommu_setup_dma_ops(struct device *dev, u64 
dma_base, u64 dma_limit)
 pr_warn("Failed to set up IOMMU for device %s; retaining platform DMA 
ops\n",
 dev_name(dev));
 }
+EXPORT_SYMBOL_GPL(iommu_setup_dma_ops);
 
 static struct iommu_dma_msi_page *iommu_dma_get_msi_page(struct device *dev,
phys_addr_t msi_addr, struct iommu_domain *domain)
diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index 218fe8560e8d..77aee1207ced 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -1026,6 +1026,13 @@ static struct iommu_device *viommu_probe_device(struct 
device *dev)
return ERR_PTR(ret);
 }
 
+static void viommu_probe_finalize(struct device *dev)
+{
+#ifndef CONFIG_ARCH_HAS_SETUP_DMA_OPS
+   iommu_setup_dma_ops(dev, 0, U64_MAX);
+#endif
+}
+
 static void viommu_release_device(struct device *dev)
 {
struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
@@ -1062,6 +1069,7 @@ static struct iommu_ops viommu_ops = {
.iova_to_phys   = viommu_iova_to_phys,
.iotlb_sync = viommu_iotlb_sync,
.probe_device   = viommu_probe_device,
+   .probe_finalize = viommu_probe_finalize,
.release_device = viommu_release_device,
.device_group   = viommu_device_group,
.get_resv_regions   = viommu_get_resv_regions,
-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v4 1/6] ACPI: arm64: Move DMA setup operations out of IORT

2021-06-10 Thread Jean-Philippe Brucker

Extract generic DMA setup code out of IORT, so it can be reused by VIOT.
Keep it in drivers/acpi/arm64 for now, since it could break x86
platforms that haven't run this code so far, if they have invalid
tables.

Signed-off-by: Jean-Philippe Brucker 
---
 drivers/acpi/arm64/Makefile |  1 +
 include/linux/acpi.h|  3 +++
 include/linux/acpi_iort.h   |  6 ++---
 drivers/acpi/arm64/dma.c| 50 ++
 drivers/acpi/arm64/iort.c   | 54 ++---
 drivers/acpi/scan.c |  2 +-
 6 files changed, 66 insertions(+), 50 deletions(-)
 create mode 100644 drivers/acpi/arm64/dma.c

diff --git a/drivers/acpi/arm64/Makefile b/drivers/acpi/arm64/Makefile
index 6ff50f4ed947..66acbe77f46e 100644
--- a/drivers/acpi/arm64/Makefile
+++ b/drivers/acpi/arm64/Makefile
@@ -1,3 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0-only
 obj-$(CONFIG_ACPI_IORT)+= iort.o
 obj-$(CONFIG_ACPI_GTDT)+= gtdt.o
+obj-y  += dma.o
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index c60745f657e9..7aaa9559cc19 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -259,9 +259,12 @@ void acpi_numa_x2apic_affinity_init(struct 
acpi_srat_x2apic_cpu_affinity *pa);
 
 #ifdef CONFIG_ARM64
 void acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa);
+void acpi_arch_dma_setup(struct device *dev, u64 *dma_addr, u64 *dma_size);
 #else
 static inline void
 acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa) { }
+static inline void
+acpi_arch_dma_setup(struct device *dev, u64 *dma_addr, u64 *dma_size) { }
 #endif
 
 int acpi_numa_memory_affinity_init (struct acpi_srat_mem_affinity *ma);
diff --git a/include/linux/acpi_iort.h b/include/linux/acpi_iort.h
index 1a12baa58e40..f7f054833afd 100644
--- a/include/linux/acpi_iort.h
+++ b/include/linux/acpi_iort.h
@@ -34,7 +34,7 @@ struct irq_domain *iort_get_device_domain(struct device *dev, 
u32 id,
 void acpi_configure_pmsi_domain(struct device *dev);
 int iort_pmsi_get_dev_id(struct device *dev, u32 *dev_id);
 /* IOMMU interface */
-void iort_dma_setup(struct device *dev, u64 *dma_addr, u64 *size);
+int iort_dma_get_ranges(struct device *dev, u64 *size);
 const struct iommu_ops *iort_iommu_configure_id(struct device *dev,
const u32 *id_in);
 int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head 
*head);
@@ -48,8 +48,8 @@ static inline struct irq_domain *iort_get_device_domain(
 { return NULL; }
 static inline void acpi_configure_pmsi_domain(struct device *dev) { }
 /* IOMMU interface */
-static inline void iort_dma_setup(struct device *dev, u64 *dma_addr,
- u64 *size) { }
+static inline int iort_dma_get_ranges(struct device *dev, u64 *size)
+{ return -ENODEV; }
 static inline const struct iommu_ops *iort_iommu_configure_id(
  struct device *dev, const u32 *id_in)
 { return NULL; }
diff --git a/drivers/acpi/arm64/dma.c b/drivers/acpi/arm64/dma.c
new file mode 100644
index ..f16739ad3cc0
--- /dev/null
+++ b/drivers/acpi/arm64/dma.c
@@ -0,0 +1,50 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include 
+#include 
+#include 
+#include 
+
+void acpi_arch_dma_setup(struct device *dev, u64 *dma_addr, u64 *dma_size)
+{
+   int ret;
+   u64 end, mask;
+   u64 dmaaddr = 0, size = 0, offset = 0;
+
+   /*
+* If @dev is expected to be DMA-capable then the bus code that created
+* it should have initialised its dma_mask pointer by this point. For
+* now, we'll continue the legacy behaviour of coercing it to the
+* coherent mask if not, but we'll no longer do so quietly.
+*/
+   if (!dev->dma_mask) {
+   dev_warn(dev, "DMA mask not set\n");
+   dev->dma_mask = &dev->coherent_dma_mask;
+   }
+
+   if (dev->coherent_dma_mask)
+   size = max(dev->coherent_dma_mask, dev->coherent_dma_mask + 1);
+   else
+   size = 1ULL << 32;
+
+   ret = acpi_dma_get_range(dev, &dmaaddr, &offset, &size);
+   if (ret == -ENODEV)
+   ret = iort_dma_get_ranges(dev, &size);
+   if (!ret) {
+   /*
+* Limit coherent and dma mask based on size retrieved from
+* firmware.
+*/
+   end = dmaaddr + size - 1;
+   mask = DMA_BIT_MASK(ilog2(end) + 1);
+   dev->bus_dma_limit = end;
+   dev->coherent_dma_mask = min(dev->coherent_dma_mask, mask);
+   *dev->dma_mask = min(*dev->dma_mask, mask);
+   }
+
+   *dma_addr = dmaaddr;
+   *dma_size = size;
+
+   ret = dma_direct_set_offset(dev, dmaaddr + offset, dmaaddr, size);
+
+   dev_dbg(dev, "dma_offset(%#08llx)%s\n", offset, ret ? " failed!" : "");
+}
diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index 3912a1f6058e..a940be1cf2af 100644

[PATCH v4 0/6] Add support for ACPI VIOT

2021-06-10 Thread Jean-Philippe Brucker

Add a driver for the ACPI VIOT table, which provides topology
information for para-virtual IOMMUs. Enable virtio-iommu on
non-devicetree platforms, including x86.

Since v3 [1] I fixed a build bug for !CONFIG_IOMMU_API. Joerg offered to
take this series through the IOMMU tree, which requires Acks for patches
1-3.

You can find a QEMU implementation at [2], with extra support for
testing all VIOT nodes including MMIO-based endpoints and IOMMU.
This series is at [3].

[1] 
https://lore.kernel.org/linux-iommu/2021060215.1077006-1-jean-phili...@linaro.org/
[2] https://jpbrucker.net/git/qemu/log/?h=virtio-iommu/acpi
[3] https://jpbrucker.net/git/linux/log/?h=virtio-iommu/acpi


Jean-Philippe Brucker (6):
  ACPI: arm64: Move DMA setup operations out of IORT
  ACPI: Move IOMMU setup code out of IORT
  ACPI: Add driver for the VIOT table
  iommu/dma: Pass address limit rather than size to
iommu_setup_dma_ops()
  iommu/dma: Simplify calls to iommu_setup_dma_ops()
  iommu/virtio: Enable x86 support

 drivers/acpi/Kconfig |   3 +
 drivers/iommu/Kconfig|   4 +-
 drivers/acpi/Makefile|   2 +
 drivers/acpi/arm64/Makefile  |   1 +
 include/acpi/acpi_bus.h  |   3 +
 include/linux/acpi.h |   3 +
 include/linux/acpi_iort.h|  14 +-
 include/linux/acpi_viot.h|  19 ++
 include/linux/dma-iommu.h|   4 +-
 arch/arm64/mm/dma-mapping.c  |   2 +-
 drivers/acpi/arm64/dma.c |  50 +
 drivers/acpi/arm64/iort.c| 129 ++---
 drivers/acpi/bus.c   |   2 +
 drivers/acpi/scan.c  |  78 +++-
 drivers/acpi/viot.c  | 364 +++
 drivers/iommu/amd/iommu.c|   9 +-
 drivers/iommu/dma-iommu.c|  17 +-
 drivers/iommu/intel/iommu.c  |  10 +-
 drivers/iommu/virtio-iommu.c |   8 +
 MAINTAINERS  |   8 +
 20 files changed, 580 insertions(+), 150 deletions(-)
 create mode 100644 include/linux/acpi_viot.h
 create mode 100644 drivers/acpi/arm64/dma.c
 create mode 100644 drivers/acpi/viot.c

-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC v1 0/6] virtio/vsock: introduce SOCK_DGRAM support

2021-06-10 Thread Jason Wang



在 2021/6/10 下午3:23, Stefano Garzarella 写道:

On Thu, Jun 10, 2021 at 12:02:35PM +0800, Jason Wang wrote:


在 2021/6/10 上午11:43, Jiang Wang . 写道:

On Wed, Jun 9, 2021 at 6:51 PM Jason Wang  wrote:


在 2021/6/10 上午7:24, Jiang Wang 写道:

This patchset implements support of SOCK_DGRAM for virtio
transport.

Datagram sockets are connectionless and unreliable. To avoid 
unfair contention

with stream and other sockets, add two more virtqueues and
a new feature bit to indicate if those two new queues exist or not.

Dgram does not use the existing credit update mechanism for
stream sockets. When sending from the guest/driver, sending packets
synchronously, so the sender will get an error when the virtqueue 
is full.

When sending from the host/device, send packets asynchronously
because the descriptor memory belongs to the corresponding QEMU
process.


What's the use case for the datagram vsock?


One use case is for non critical info logging from the guest
to the host, such as the performance data of some applications.



Anything that prevents you from using the stream socket?




It can also be used to replace UDP communications between
the guest and the host.



Any advantage for VSOCK in this case? Is it for performance (I guess 
not since I don't exepct vsock will be faster).


I think the general advantage to using vsock are for the guest agents 
that potentially don't need any configuration.



Right, I wonder if we really need datagram consider the host to guest 
communication is reliable.


(Note that I don't object it since vsock has already supported that, 
just wonder its use cases)







An obvious drawback is that it breaks the migration. Using UDP you 
can have a very rich features support from the kernel where vsock can't.




Thanks for bringing this up!
What features does UDP support and datagram on vsock could not support?



E.g the sendpage() and busy polling. And using UDP means qdiscs and eBPF 
can work.










The virtio spec patch is here:
https://www.spinics.net/lists/linux-virtualization/msg50027.html


Have a quick glance, I suggest to split mergeable rx buffer into an
separate patch.

Sure.

But I think it's time to revisit the idea of unifying the 
virtio-net and

virtio-vsock. Otherwise we're duplicating features and bugs.

For mergeable rxbuf related code, I think a set of common helper
functions can be used by both virtio-net and virtio-vsock. For other
parts, that may not be very beneficial. I will think about more.

If there is a previous email discussion about this topic, could you 
send me

some links? I did a quick web search but did not find any related
info. Thanks.



We had a lot:

[1] 
https://patchwork.kernel.org/project/kvm/patch/5bdff537.3050...@huawei.com/
[2] 
https://lists.linuxfoundation.org/pipermail/virtualization/2018-November/039798.html

[3] https://www.lkml.org/lkml/2020/1/16/2043



When I tried it, the biggest problem that blocked me were all the 
features strictly related to TCP/IP stack and ethernet devices that 
vsock device doesn't know how to handle: TSO, GSO, checksums, MAC, 
napi, xdp, min ethernet frame size, MTU, etc.



It depends on which level we want to share:

1) sharing codes
2) sharing devices
3) make vsock a protocol that is understood by the network core

We can start from 1), the low level tx/rx logic can be shared at both 
virtio-net and vhost-net. For 2) we probably need some work on the spec, 
probably with a new feature bit to demonstrate that it's a vsock device 
not a ethernet device. Then if it is probed as a vsock device we won't 
let packet to be delivered in the TCP/IP stack. For 3), it would be even 
harder and I'm not sure it's worth to do that.





So in my opinion to unify them is not so simple, because vsock is not 
really an ethernet device, but simply a socket.



We can start from sharing codes.




But I fully agree that we shouldn't duplicate functionality and code, 
so maybe we could find those common parts and create helpers to be 
used by both.



Yes.

Thanks




Thanks,
Stefano



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC v1 0/6] virtio/vsock: introduce SOCK_DGRAM support

2021-06-10 Thread Stefano Garzarella


On Thu, Jun 10, 2021 at 12:02:35PM +0800, Jason Wang wrote:


在 2021/6/10 上午11:43, Jiang Wang . 写道:

On Wed, Jun 9, 2021 at 6:51 PM Jason Wang  wrote:


在 2021/6/10 上午7:24, Jiang Wang 写道:

This patchset implements support of SOCK_DGRAM for virtio
transport.

Datagram sockets are connectionless and unreliable. To avoid unfair contention
with stream and other sockets, add two more virtqueues and
a new feature bit to indicate if those two new queues exist or not.

Dgram does not use the existing credit update mechanism for
stream sockets. When sending from the guest/driver, sending packets
synchronously, so the sender will get an error when the virtqueue is 
full.

When sending from the host/device, send packets asynchronously
because the descriptor memory belongs to the corresponding QEMU
process.


What's the use case for the datagram vsock?


One use case is for non critical info logging from the guest
to the host, such as the performance data of some applications.



Anything that prevents you from using the stream socket?




It can also be used to replace UDP communications between
the guest and the host.



Any advantage for VSOCK in this case? Is it for performance (I guess 
not since I don't exepct vsock will be faster).


I think the general advantage to using vsock are for the guest agents 
that potentially don't need any configuration.




An obvious drawback is that it breaks the migration. Using UDP you can 
have a very rich features support from the kernel where vsock can't.




Thanks for bringing this up!
What features does UDP support and datagram on vsock could not support?






The virtio spec patch is here:
https://www.spinics.net/lists/linux-virtualization/msg50027.html


Have a quick glance, I suggest to split mergeable rx buffer into an
separate patch.

Sure.

But I think it's time to revisit the idea of unifying the virtio-net 
and

virtio-vsock. Otherwise we're duplicating features and bugs.

For mergeable rxbuf related code, I think a set of common helper
functions can be used by both virtio-net and virtio-vsock. For other
parts, that may not be very beneficial. I will think about more.

If there is a previous email discussion about this topic, could you 
send me

some links? I did a quick web search but did not find any related
info. Thanks.



We had a lot:

[1] 
https://patchwork.kernel.org/project/kvm/patch/5bdff537.3050...@huawei.com/
[2] 
https://lists.linuxfoundation.org/pipermail/virtualization/2018-November/039798.html

[3] https://www.lkml.org/lkml/2020/1/16/2043



When I tried it, the biggest problem that blocked me were all the 
features strictly related to TCP/IP stack and ethernet devices that 
vsock device doesn't know how to handle: TSO, GSO, checksums, MAC, napi, 
xdp, min ethernet frame size, MTU, etc.


So in my opinion to unify them is not so simple, because vsock is not 
really an ethernet device, but simply a socket.


But I fully agree that we shouldn't duplicate functionality and code, so 
maybe we could find those common parts and create helpers to be used by 
both.


Thanks,
Stefano

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

54 matches

Mail list logo