Re: [PATCH net-next 0/5] virtio-net: support dynamic coalescing moderation

2023-10-24 Thread Michael S. Tsirkin
On Wed, Oct 25, 2023 at 09:18:27AM +0800, Jason Wang wrote:
> On Tue, Oct 24, 2023 at 8:03 PM Heng Qi  wrote:
> >
> >
> >
> > 在 2023/10/12 下午4:29, Jason Wang 写道:
> > > On Thu, Oct 12, 2023 at 3:44 PM Heng Qi  wrote:
> > >> Now, virtio-net already supports per-queue moderation parameter
> > >> setting. Based on this, we use the netdim library of linux to support
> > >> dynamic coalescing moderation for virtio-net.
> > >>
> > >> Due to hardware scheduling issues, we only tested rx dim.
> > > Do you have PPS numbers? And TX numbers are also important as the
> > > throughput could be misleading due to various reasons.
> >
> > Hi Jason!
> >
> > The comparison of rx netdim performance is as follows:
> > (the backend supporting tx dim is not yet ready)
> 
> Thanks a lot for the numbers.
> 
> I'd still expect the TX result as I did play tx interrupt coalescing
> about 10 years ago.
> 
> I will start to review the series but let's try to have some TX numbers as 
> well.
> 
> Btw, it would be more convenient to have a raw PPS benchmark. E.g you
> can try to use a software or hardware packet generator.
> 
> Thanks

Latency results are also kind of interesting.


> >
> >
> > I. Sockperf UDP
> > =
> > 1. Env
> > rxq_0 is affinity to cpu_0
> >
> > 2. Cmd
> > client:  taskset -c 0 sockperf tp -p 8989 -i $IP -t 10 -m 16B
> > server: taskset -c 0 sockperf sr -p 8989
> >
> > 3. Result
> > dim off: 1143277.00 rxpps, throughput 17.844 MBps, cpu is 100%.
> > dim on: 1124161.00 rxpps, throughput 17.610 MBps, cpu is 83.5%.
> > =
> >
> >
> > II. Redis
> > =
> > 1. Env
> > There are 8 rxqs and rxq_i is affinity to cpu_i.
> >
> > 2. Result
> > When all cpus are 100%, ops/sec of memtier_benchmark client is
> > dim off:   978437.23
> > dim on: 1143638.28
> > =
> >
> >
> > III. Nginx
> > =
> > 1. Env
> > There are 8 rxqs and rxq_i is affinity to cpu_i.
> >
> > 2. Result
> > When all cpus are 100%, requests/sec of wrk client is
> > dim off:   877931.67
> > dim on: 1019160.31
> > =
> >
> > Thanks!
> >
> > >
> > > Thanks
> > >
> > >> @Test env
> > >> rxq0 has affinity to cpu0.
> > >>
> > >> @Test cmd
> > >> client: taskset -c 0 sockperf tp -i ${IP} -t 30 --tcp -m ${msg_size}
> > >> server: taskset -c 0 sockperf sr --tcp
> > >>
> > >> @Test res
> > >> The second column is the ratio of the result returned by client
> > >> when rx dim is enabled to the result returned by client when
> > >> rx dim is disabled.
> > >>  --
> > >>  | msg_size |  rx_dim=on / rx_dim=off |
> > >>  --
> > >>  |   14B| + 3%|
> > >>  --
> > >>  |   100B   | + 16%   |
> > >>  --
> > >>  |   500B   | + 25%   |
> > >>  --
> > >>  |   1400B  | + 28%   |
> > >>  --
> > >>  |   2048B  | + 22%   |
> > >>  --
> > >>  |   4096B  | + 5%|
> > >>  --
> > >>
> > >> ---
> > >> This patch set was part of the previous netdim patch set[1].
> > >> [1] was split into a merged bugfix set[2] and the current set.
> > >> The previous relevant commentators have been Cced.
> > >>
> > >> [1] 
> > >> https://lore.kernel.org/all/20230811065512.22190-1-hen...@linux.alibaba.com/
> > >> [2] 
> > >> https://lore.kernel.org/all/cover.1696745452.git.hen...@linux.alibaba.com/
> > >>
> > >> Heng Qi (5):
> > >>virtio-net: returns whether napi is complete
> > >>virtio-net: separate rx/tx coalescing moderation cmds
> > >>virtio-net: extract virtqueue coalescig cmd for reuse
> > >>virtio-net: support rx netdim
> > >>virtio-net: support tx netdim
> > >>
> > >>   drivers/net/virtio_net.c | 394 ---
> > >>   1 file changed, 322 insertions(+), 72 deletions(-)
> > >>
> > >> --
> > >> 2.19.1.6.gb485710b
> > >>
> > >>
> >
> >

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next 5/5] virtio-net: support tx netdim

2023-10-24 Thread Michael S. Tsirkin
On Wed, Oct 25, 2023 at 11:35:43AM +0800, Jason Wang wrote:
> On Thu, Oct 12, 2023 at 3:44 PM Heng Qi  wrote:
> >
> > Similar to rx netdim, this patch supports adaptive tx
> > coalescing moderation for the virtio-net.
> >
> > Signed-off-by: Heng Qi 
> > ---
> >  drivers/net/virtio_net.c | 143 ---
> >  1 file changed, 119 insertions(+), 24 deletions(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 6ad2890a7909..1c680cb09d48 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -154,6 +154,15 @@ struct send_queue {
> >
> > struct virtnet_sq_stats stats;
> >
> > +   /* The number of tx notifications */
> > +   u16 calls;
> > +
> > +   /* Is dynamic interrupt moderation enabled? */
> > +   bool dim_enabled;
> > +
> > +   /* Dynamic Interrupt Moderation */
> > +   struct dim dim;
> > +
> > struct virtnet_interrupt_coalesce intr_coal;
> >
> > struct napi_struct napi;
> > @@ -317,8 +326,9 @@ struct virtnet_info {
> > u8 duplex;
> > u32 speed;
> >
> > -   /* Is rx dynamic interrupt moderation enabled? */
> > +   /* Is dynamic interrupt moderation enabled? */
> > bool rx_dim_enabled;
> > +   bool tx_dim_enabled;
> >
> > /* Interrupt coalescing settings */
> > struct virtnet_interrupt_coalesce intr_coal_tx;
> > @@ -464,19 +474,40 @@ static bool virtqueue_napi_complete(struct 
> > napi_struct *napi,
> > return false;
> >  }
> >
> > +static void virtnet_tx_dim_work(struct work_struct *work);
> > +
> > +static void virtnet_tx_dim_update(struct virtnet_info *vi, struct 
> > send_queue *sq)
> > +{
> > +   struct virtnet_sq_stats *stats = &sq->stats;
> > +   struct dim_sample cur_sample = {};
> > +
> > +   u64_stats_update_begin(&sq->stats.syncp);
> > +   dim_update_sample(sq->calls, stats->packets,
> > + stats->bytes, &cur_sample);
> > +   u64_stats_update_end(&sq->stats.syncp);
> > +
> > +   net_dim(&sq->dim, cur_sample);
> > +}
> > +
> >  static void skb_xmit_done(struct virtqueue *vq)
> >  {
> > struct virtnet_info *vi = vq->vdev->priv;
> > -   struct napi_struct *napi = &vi->sq[vq2txq(vq)].napi;
> > +   struct send_queue *sq = &vi->sq[vq2txq(vq)];
> > +   struct napi_struct *napi = &sq->napi;
> > +
> > +   sq->calls++;
> 
> I wonder what's the impact of this counters for netdim. As we have a
> mode of orphan skb in xmit.
> 
> We need to test to see.
> 
> Thanks

Agreed, performance patches should come with performance results.

-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next 0/5] virtio-net: support dynamic coalescing moderation

2023-10-24 Thread Michael S. Tsirkin
On Thu, Oct 12, 2023 at 03:44:04PM +0800, Heng Qi wrote:
> Now, virtio-net already supports per-queue moderation parameter
> setting. Based on this, we use the netdim library of linux to support
> dynamic coalescing moderation for virtio-net.
> 
> Due to hardware scheduling issues, we only tested rx dim.

So patches 1 to 4 look ok but patch 5 is untested - we should
probably wait until it's tested properly.


> @Test env
> rxq0 has affinity to cpu0.
> 
> @Test cmd
> client: taskset -c 0 sockperf tp -i ${IP} -t 30 --tcp -m ${msg_size}
> server: taskset -c 0 sockperf sr --tcp
> 
> @Test res
> The second column is the ratio of the result returned by client
> when rx dim is enabled to the result returned by client when
> rx dim is disabled.
>   --
>   | msg_size |  rx_dim=on / rx_dim=off |
>   --
>   |   14B| + 3%|   
>   --
>   |   100B   | + 16%   |
>   --
>   |   500B   | + 25%   |
>   --
>   |   1400B  | + 28%   |
>   --
>   |   2048B  | + 22%   |
>   --
>   |   4096B  | + 5%|
>   --
> 
> ---
> This patch set was part of the previous netdim patch set[1].
> [1] was split into a merged bugfix set[2] and the current set.
> The previous relevant commentators have been Cced.
> 
> [1] 
> https://lore.kernel.org/all/20230811065512.22190-1-hen...@linux.alibaba.com/
> [2] https://lore.kernel.org/all/cover.1696745452.git.hen...@linux.alibaba.com/
> 
> Heng Qi (5):
>   virtio-net: returns whether napi is complete
>   virtio-net: separate rx/tx coalescing moderation cmds
>   virtio-net: extract virtqueue coalescig cmd for reuse
>   virtio-net: support rx netdim
>   virtio-net: support tx netdim
> 
>  drivers/net/virtio_net.c | 394 ---
>  1 file changed, 322 insertions(+), 72 deletions(-)
> 
> -- 
> 2.19.1.6.gb485710b

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next 5/5] virtio-net: support tx netdim

2023-10-24 Thread Jason Wang
On Thu, Oct 12, 2023 at 3:44 PM Heng Qi  wrote:
>
> Similar to rx netdim, this patch supports adaptive tx
> coalescing moderation for the virtio-net.
>
> Signed-off-by: Heng Qi 
> ---
>  drivers/net/virtio_net.c | 143 ---
>  1 file changed, 119 insertions(+), 24 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 6ad2890a7909..1c680cb09d48 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -154,6 +154,15 @@ struct send_queue {
>
> struct virtnet_sq_stats stats;
>
> +   /* The number of tx notifications */
> +   u16 calls;
> +
> +   /* Is dynamic interrupt moderation enabled? */
> +   bool dim_enabled;
> +
> +   /* Dynamic Interrupt Moderation */
> +   struct dim dim;
> +
> struct virtnet_interrupt_coalesce intr_coal;
>
> struct napi_struct napi;
> @@ -317,8 +326,9 @@ struct virtnet_info {
> u8 duplex;
> u32 speed;
>
> -   /* Is rx dynamic interrupt moderation enabled? */
> +   /* Is dynamic interrupt moderation enabled? */
> bool rx_dim_enabled;
> +   bool tx_dim_enabled;
>
> /* Interrupt coalescing settings */
> struct virtnet_interrupt_coalesce intr_coal_tx;
> @@ -464,19 +474,40 @@ static bool virtqueue_napi_complete(struct napi_struct 
> *napi,
> return false;
>  }
>
> +static void virtnet_tx_dim_work(struct work_struct *work);
> +
> +static void virtnet_tx_dim_update(struct virtnet_info *vi, struct send_queue 
> *sq)
> +{
> +   struct virtnet_sq_stats *stats = &sq->stats;
> +   struct dim_sample cur_sample = {};
> +
> +   u64_stats_update_begin(&sq->stats.syncp);
> +   dim_update_sample(sq->calls, stats->packets,
> + stats->bytes, &cur_sample);
> +   u64_stats_update_end(&sq->stats.syncp);
> +
> +   net_dim(&sq->dim, cur_sample);
> +}
> +
>  static void skb_xmit_done(struct virtqueue *vq)
>  {
> struct virtnet_info *vi = vq->vdev->priv;
> -   struct napi_struct *napi = &vi->sq[vq2txq(vq)].napi;
> +   struct send_queue *sq = &vi->sq[vq2txq(vq)];
> +   struct napi_struct *napi = &sq->napi;
> +
> +   sq->calls++;

I wonder what's the impact of this counters for netdim. As we have a
mode of orphan skb in xmit.

We need to test to see.

Thanks

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next 4/5] virtio-net: support rx netdim

2023-10-24 Thread Jason Wang
On Thu, Oct 12, 2023 at 3:44 PM Heng Qi  wrote:
>
> By comparing the traffic information in the complete napi processes,
> let the virtio-net driver automatically adjust the coalescing
> moderation parameters of each receive queue.
>
> Signed-off-by: Heng Qi 
> ---
>  drivers/net/virtio_net.c | 147 +--
>  1 file changed, 126 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index caef78bb3963..6ad2890a7909 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -19,6 +19,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -172,6 +173,17 @@ struct receive_queue {
>
> struct virtnet_rq_stats stats;
>
> +   /* The number of rx notifications */
> +   u16 calls;
> +
> +   /* Is dynamic interrupt moderation enabled? */
> +   bool dim_enabled;
> +
> +   /* Dynamic Interrupt Moderation */
> +   struct dim dim;
> +
> +   u32 packets_in_napi;
> +
> struct virtnet_interrupt_coalesce intr_coal;
>
> /* Chain pages by the private ptr. */
> @@ -305,6 +317,9 @@ struct virtnet_info {
> u8 duplex;
> u32 speed;
>
> +   /* Is rx dynamic interrupt moderation enabled? */
> +   bool rx_dim_enabled;
> +
> /* Interrupt coalescing settings */
> struct virtnet_interrupt_coalesce intr_coal_tx;
> struct virtnet_interrupt_coalesce intr_coal_rx;
> @@ -2001,6 +2016,7 @@ static void skb_recv_done(struct virtqueue *rvq)
> struct virtnet_info *vi = rvq->vdev->priv;
> struct receive_queue *rq = &vi->rq[vq2rxq(rvq)];
>
> +   rq->calls++;
> virtqueue_napi_schedule(&rq->napi, rvq);
>  }
>
> @@ -2138,6 +2154,25 @@ static void virtnet_poll_cleantx(struct receive_queue 
> *rq)
> }
>  }
>
> +static void virtnet_rx_dim_work(struct work_struct *work);
> +
> +static void virtnet_rx_dim_update(struct virtnet_info *vi, struct 
> receive_queue *rq)
> +{
> +   struct virtnet_rq_stats *stats = &rq->stats;
> +   struct dim_sample cur_sample = {};
> +
> +   if (!rq->packets_in_napi)
> +   return;
> +
> +   u64_stats_update_begin(&rq->stats.syncp);
> +   dim_update_sample(rq->calls, stats->packets,
> + stats->bytes, &cur_sample);
> +   u64_stats_update_end(&rq->stats.syncp);
> +
> +   net_dim(&rq->dim, cur_sample);
> +   rq->packets_in_napi = 0;
> +}
> +
>  static int virtnet_poll(struct napi_struct *napi, int budget)
>  {
> struct receive_queue *rq =
> @@ -2146,17 +2181,22 @@ static int virtnet_poll(struct napi_struct *napi, int 
> budget)
> struct send_queue *sq;
> unsigned int received;
> unsigned int xdp_xmit = 0;
> +   bool napi_complete;
>
> virtnet_poll_cleantx(rq);
>
> received = virtnet_receive(rq, budget, &xdp_xmit);
> +   rq->packets_in_napi += received;
>
> if (xdp_xmit & VIRTIO_XDP_REDIR)
> xdp_do_flush();
>
> /* Out of packets? */
> -   if (received < budget)
> -   virtqueue_napi_complete(napi, rq->vq, received);
> +   if (received < budget) {
> +   napi_complete = virtqueue_napi_complete(napi, rq->vq, 
> received);
> +   if (napi_complete && rq->dim_enabled)
> +   virtnet_rx_dim_update(vi, rq);
> +   }
>
> if (xdp_xmit & VIRTIO_XDP_TX) {
> sq = virtnet_xdp_get_sq(vi);
> @@ -2176,6 +2216,7 @@ static void virtnet_disable_queue_pair(struct 
> virtnet_info *vi, int qp_index)
> virtnet_napi_tx_disable(&vi->sq[qp_index].napi);
> napi_disable(&vi->rq[qp_index].napi);
> xdp_rxq_info_unreg(&vi->rq[qp_index].xdp_rxq);
> +   cancel_work_sync(&vi->rq[qp_index].dim.work);
>  }
>
>  static int virtnet_enable_queue_pair(struct virtnet_info *vi, int qp_index)
> @@ -2193,6 +2234,9 @@ static int virtnet_enable_queue_pair(struct 
> virtnet_info *vi, int qp_index)
> if (err < 0)
> goto err_xdp_reg_mem_model;
>
> +   INIT_WORK(&vi->rq[qp_index].dim.work, virtnet_rx_dim_work);
> +   vi->rq[qp_index].dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
> +
> virtnet_napi_enable(vi->rq[qp_index].vq, &vi->rq[qp_index].napi);
> virtnet_napi_tx_enable(vi, vi->sq[qp_index].vq, 
> &vi->sq[qp_index].napi);
>
> @@ -3335,23 +3379,42 @@ static int virtnet_send_tx_notf_coal_cmds(struct 
> virtnet_info *vi,
>  static int virtnet_send_rx_notf_coal_cmds(struct virtnet_info *vi,
>   struct ethtool_coalesce *ec)
>  {
> +   bool rx_ctrl_dim_on = !!ec->use_adaptive_rx_coalesce;
> struct scatterlist sgs_rx;
> +   int i;
>
> -   vi->ctrl->coal_rx.rx_usecs = cpu_to_le32(ec->rx_coalesce_usecs);
> -   vi->ctrl->coal_rx.rx_max_packets = 
> cpu_to_le32(ec->rx_max_coalesced_frames);
> -   sg_init_one(&sgs_rx, &vi->ctrl->coal_

Re: [PATCH net-next 3/5] virtio-net: extract virtqueue coalescig cmd for reuse

2023-10-24 Thread Jason Wang
On Thu, Oct 12, 2023 at 3:44 PM Heng Qi  wrote:
>
> Extract commands to set virtqueue coalescing parameters for reuse
> by ethtool -Q, vq resize and netdim.
>
> Signed-off-by: Heng Qi 

Acked-by: Jason Wang 

Thanks

> ---
>  drivers/net/virtio_net.c | 106 +++
>  1 file changed, 64 insertions(+), 42 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 54b3fb8d0384..caef78bb3963 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -2846,6 +2846,58 @@ static void virtnet_cpu_notif_remove(struct 
> virtnet_info *vi)
> &vi->node_dead);
>  }
>
> +static int virtnet_send_ctrl_coal_vq_cmd(struct virtnet_info *vi,
> +u16 vqn, u32 max_usecs, u32 
> max_packets)
> +{
> +   struct scatterlist sgs;
> +
> +   vi->ctrl->coal_vq.vqn = cpu_to_le16(vqn);
> +   vi->ctrl->coal_vq.coal.max_usecs = cpu_to_le32(max_usecs);
> +   vi->ctrl->coal_vq.coal.max_packets = cpu_to_le32(max_packets);
> +   sg_init_one(&sgs, &vi->ctrl->coal_vq, sizeof(vi->ctrl->coal_vq));
> +
> +   if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_NOTF_COAL,
> + VIRTIO_NET_CTRL_NOTF_COAL_VQ_SET,
> + &sgs))
> +   return -EINVAL;
> +
> +   return 0;
> +}
> +
> +static int virtnet_send_rx_ctrl_coal_vq_cmd(struct virtnet_info *vi,
> +   u16 queue, u32 max_usecs,
> +   u32 max_packets)
> +{
> +   int err;
> +
> +   err = virtnet_send_ctrl_coal_vq_cmd(vi, rxq2vq(queue),
> +   max_usecs, max_packets);
> +   if (err)
> +   return err;
> +
> +   vi->rq[queue].intr_coal.max_usecs = max_usecs;
> +   vi->rq[queue].intr_coal.max_packets = max_packets;
> +
> +   return 0;
> +}
> +
> +static int virtnet_send_tx_ctrl_coal_vq_cmd(struct virtnet_info *vi,
> +   u16 queue, u32 max_usecs,
> +   u32 max_packets)
> +{
> +   int err;
> +
> +   err = virtnet_send_ctrl_coal_vq_cmd(vi, txq2vq(queue),
> +   max_usecs, max_packets);
> +   if (err)
> +   return err;
> +
> +   vi->sq[queue].intr_coal.max_usecs = max_usecs;
> +   vi->sq[queue].intr_coal.max_packets = max_packets;
> +
> +   return 0;
> +}
> +
>  static void virtnet_get_ringparam(struct net_device *dev,
>   struct ethtool_ringparam *ring,
>   struct kernel_ethtool_ringparam 
> *kernel_ring,
> @@ -2903,14 +2955,11 @@ static int virtnet_set_ringparam(struct net_device 
> *dev,
>  * through the VIRTIO_NET_CTRL_NOTF_COAL_TX_SET 
> command, or, if the driver
>  * did not set any TX coalescing parameters, to 0.
>  */
> -   err = virtnet_send_ctrl_coal_vq_cmd(vi, txq2vq(i),
> -   
> vi->intr_coal_tx.max_usecs,
> -   
> vi->intr_coal_tx.max_packets);
> +   err = virtnet_send_tx_ctrl_coal_vq_cmd(vi, i,
> +  
> vi->intr_coal_tx.max_usecs,
> +  
> vi->intr_coal_tx.max_packets);
> if (err)
> return err;
> -
> -   vi->sq[i].intr_coal.max_usecs = 
> vi->intr_coal_tx.max_usecs;
> -   vi->sq[i].intr_coal.max_packets = 
> vi->intr_coal_tx.max_packets;
> }
>
> if (ring->rx_pending != rx_pending) {
> @@ -2919,14 +2968,11 @@ static int virtnet_set_ringparam(struct net_device 
> *dev,
> return err;
>
> /* The reason is same as the transmit virtqueue reset 
> */
> -   err = virtnet_send_ctrl_coal_vq_cmd(vi, rxq2vq(i),
> -   
> vi->intr_coal_rx.max_usecs,
> -   
> vi->intr_coal_rx.max_packets);
> +   err = virtnet_send_rx_ctrl_coal_vq_cmd(vi, i,
> +  
> vi->intr_coal_rx.max_usecs,
> +  
> vi->intr_coal_rx.max_packets);
> if (err)
> return err;
> -
> -   vi->rq[i].intr_coal.max_usecs = 
> vi->intr_coal_rx.max_usecs;
> -   vi->rq[i].intr_coal.max_packets = 
> vi->intr_coal_rx.max_packets;
> }
> }
>

Re: [PATCH net-next 1/5] virtio-net: returns whether napi is complete

2023-10-24 Thread Jason Wang
On Thu, Oct 12, 2023 at 3:44 PM Heng Qi  wrote:
>
> rx netdim needs to count the traffic during a complete napi process,
> and start updating and comparing samples to make decisions after
> the napi ends. Let virtqueue_napi_complete() return true if napi is done,
> otherwise vice versa.
>
> Signed-off-by: Heng Qi 

Acked-by: Jason Wang 

Thanks

> ---
>  drivers/net/virtio_net.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index a52fd743504a..cf5d2ef4bd24 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -431,7 +431,7 @@ static void virtqueue_napi_schedule(struct napi_struct 
> *napi,
> }
>  }
>
> -static void virtqueue_napi_complete(struct napi_struct *napi,
> +static bool virtqueue_napi_complete(struct napi_struct *napi,
> struct virtqueue *vq, int processed)
>  {
> int opaque;
> @@ -440,9 +440,13 @@ static void virtqueue_napi_complete(struct napi_struct 
> *napi,
> if (napi_complete_done(napi, processed)) {
> if (unlikely(virtqueue_poll(vq, opaque)))
> virtqueue_napi_schedule(napi, vq);
> +   else
> +   return true;
> } else {
> virtqueue_disable_cb(vq);
> }
> +
> +   return false;
>  }
>
>  static void skb_xmit_done(struct virtqueue *vq)
> --
> 2.19.1.6.gb485710b
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next 0/5] virtio-net: support dynamic coalescing moderation

2023-10-24 Thread Jason Wang
On Tue, Oct 24, 2023 at 8:03 PM Heng Qi  wrote:
>
>
>
> 在 2023/10/12 下午4:29, Jason Wang 写道:
> > On Thu, Oct 12, 2023 at 3:44 PM Heng Qi  wrote:
> >> Now, virtio-net already supports per-queue moderation parameter
> >> setting. Based on this, we use the netdim library of linux to support
> >> dynamic coalescing moderation for virtio-net.
> >>
> >> Due to hardware scheduling issues, we only tested rx dim.
> > Do you have PPS numbers? And TX numbers are also important as the
> > throughput could be misleading due to various reasons.
>
> Hi Jason!
>
> The comparison of rx netdim performance is as follows:
> (the backend supporting tx dim is not yet ready)

Thanks a lot for the numbers.

I'd still expect the TX result as I did play tx interrupt coalescing
about 10 years ago.

I will start to review the series but let's try to have some TX numbers as well.

Btw, it would be more convenient to have a raw PPS benchmark. E.g you
can try to use a software or hardware packet generator.

Thanks

>
>
> I. Sockperf UDP
> =
> 1. Env
> rxq_0 is affinity to cpu_0
>
> 2. Cmd
> client:  taskset -c 0 sockperf tp -p 8989 -i $IP -t 10 -m 16B
> server: taskset -c 0 sockperf sr -p 8989
>
> 3. Result
> dim off: 1143277.00 rxpps, throughput 17.844 MBps, cpu is 100%.
> dim on: 1124161.00 rxpps, throughput 17.610 MBps, cpu is 83.5%.
> =
>
>
> II. Redis
> =
> 1. Env
> There are 8 rxqs and rxq_i is affinity to cpu_i.
>
> 2. Result
> When all cpus are 100%, ops/sec of memtier_benchmark client is
> dim off:   978437.23
> dim on: 1143638.28
> =
>
>
> III. Nginx
> =
> 1. Env
> There are 8 rxqs and rxq_i is affinity to cpu_i.
>
> 2. Result
> When all cpus are 100%, requests/sec of wrk client is
> dim off:   877931.67
> dim on: 1019160.31
> =
>
> Thanks!
>
> >
> > Thanks
> >
> >> @Test env
> >> rxq0 has affinity to cpu0.
> >>
> >> @Test cmd
> >> client: taskset -c 0 sockperf tp -i ${IP} -t 30 --tcp -m ${msg_size}
> >> server: taskset -c 0 sockperf sr --tcp
> >>
> >> @Test res
> >> The second column is the ratio of the result returned by client
> >> when rx dim is enabled to the result returned by client when
> >> rx dim is disabled.
> >>  --
> >>  | msg_size |  rx_dim=on / rx_dim=off |
> >>  --
> >>  |   14B| + 3%|
> >>  --
> >>  |   100B   | + 16%   |
> >>  --
> >>  |   500B   | + 25%   |
> >>  --
> >>  |   1400B  | + 28%   |
> >>  --
> >>  |   2048B  | + 22%   |
> >>  --
> >>  |   4096B  | + 5%|
> >>  --
> >>
> >> ---
> >> This patch set was part of the previous netdim patch set[1].
> >> [1] was split into a merged bugfix set[2] and the current set.
> >> The previous relevant commentators have been Cced.
> >>
> >> [1] 
> >> https://lore.kernel.org/all/20230811065512.22190-1-hen...@linux.alibaba.com/
> >> [2] 
> >> https://lore.kernel.org/all/cover.1696745452.git.hen...@linux.alibaba.com/
> >>
> >> Heng Qi (5):
> >>virtio-net: returns whether napi is complete
> >>virtio-net: separate rx/tx coalescing moderation cmds
> >>virtio-net: extract virtqueue coalescig cmd for reuse
> >>virtio-net: support rx netdim
> >>virtio-net: support tx netdim
> >>
> >>   drivers/net/virtio_net.c | 394 ---
> >>   1 file changed, 322 insertions(+), 72 deletions(-)
> >>
> >> --
> >> 2.19.1.6.gb485710b
> >>
> >>
>
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v4 5/7] vhost-vdpa: clean iotlb map during reset for older userspace

2023-10-24 Thread Jason Wang
On Wed, Oct 25, 2023 at 12:25 AM Si-Wei Liu  wrote:
>
>
>
> On 10/24/2023 9:21 AM, Si-Wei Liu wrote:
> >
> >
> > On 10/23/2023 10:45 PM, Jason Wang wrote:
> >> On Sat, Oct 21, 2023 at 5:28 PM Si-Wei Liu 
> >> wrote:
> >>> Using .compat_reset op from the previous patch, the buggy .reset
> >>> behaviour can be kept as-is on older userspace apps, which don't ack
> >>> the
> >>> IOTLB_PERSIST backend feature. As this compatibility quirk is
> >>> limited to
> >>> those drivers that used to be buggy in the past, it won't affect change
> >>> the behaviour or affect ABI on the setups with API compliant driver.
> >>>
> >>> The separation of .compat_reset from the regular .reset allows
> >>> vhost-vdpa able to know which driver had broken behaviour before, so it
> >>> can apply the corresponding compatibility quirk to the individual
> >>> driver
> >>> whenever needed.  Compared to overloading the existing .reset with
> >>> flags, .compat_reset won't cause any extra burden to the implementation
> >>> of every compliant driver.
> >>>
> >>> Signed-off-by: Si-Wei Liu 
> >>> ---
> >>>   drivers/vhost/vdpa.c | 17 +
> >>>   drivers/virtio/virtio_vdpa.c |  2 +-
> >>>   include/linux/vdpa.h |  7 +--
> >>>   3 files changed, 19 insertions(+), 7 deletions(-)
> >>>
> >>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> >>> index acc7c74ba7d6..9ce40003793b 100644
> >>> --- a/drivers/vhost/vdpa.c
> >>> +++ b/drivers/vhost/vdpa.c
> >>> @@ -227,13 +227,22 @@ static void vhost_vdpa_unsetup_vq_irq(struct
> >>> vhost_vdpa *v, u16 qid)
> >>> irq_bypass_unregister_producer(&vq->call_ctx.producer);
> >>>   }
> >>>
> >>> -static int vhost_vdpa_reset(struct vhost_vdpa *v)
> >>> +static int _compat_vdpa_reset(struct vhost_vdpa *v)
> >>>   {
> >>>  struct vdpa_device *vdpa = v->vdpa;
> >>> +   u32 flags = 0;
> >>>
> >>> -   v->in_batch = 0;
> >>> +   flags |= !vhost_backend_has_feature(v->vdev.vqs[0],
> >>> + VHOST_BACKEND_F_IOTLB_PERSIST) ?
> >>> +VDPA_RESET_F_CLEAN_MAP : 0;
> >>> +
> >>> +   return vdpa_reset(vdpa, flags);
> >>> +}
> >>>
> >>> -   return vdpa_reset(vdpa);
> >>> +static int vhost_vdpa_reset(struct vhost_vdpa *v)
> >>> +{
> >>> +   v->in_batch = 0;
> >>> +   return _compat_vdpa_reset(v);
> >>>   }
> >>>
> >>>   static long vhost_vdpa_bind_mm(struct vhost_vdpa *v)
> >>> @@ -312,7 +321,7 @@ static long vhost_vdpa_set_status(struct
> >>> vhost_vdpa *v, u8 __user *statusp)
> >>>  vhost_vdpa_unsetup_vq_irq(v, i);
> >>>
> >>>  if (status == 0) {
> >>> -   ret = vdpa_reset(vdpa);
> >>> +   ret = _compat_vdpa_reset(v);
> >>>  if (ret)
> >>>  return ret;
> >>>  } else
> >>> diff --git a/drivers/virtio/virtio_vdpa.c
> >>> b/drivers/virtio/virtio_vdpa.c
> >>> index 06ce6d8c2e00..8d63e5923d24 100644
> >>> --- a/drivers/virtio/virtio_vdpa.c
> >>> +++ b/drivers/virtio/virtio_vdpa.c
> >>> @@ -100,7 +100,7 @@ static void virtio_vdpa_reset(struct
> >>> virtio_device *vdev)
> >>>   {
> >>>  struct vdpa_device *vdpa = vd_get_vdpa(vdev);
> >>>
> >>> -   vdpa_reset(vdpa);
> >>> +   vdpa_reset(vdpa, 0);
> >>>   }
> >>>
> >>>   static bool virtio_vdpa_notify(struct virtqueue *vq)
> >>> diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
> >>> index 6b8cbf75712d..db15ac07f8a6 100644
> >>> --- a/include/linux/vdpa.h
> >>> +++ b/include/linux/vdpa.h
> >>> @@ -519,14 +519,17 @@ static inline struct device
> >>> *vdpa_get_dma_dev(struct vdpa_device *vdev)
> >>>  return vdev->dma_dev;
> >>>   }
> >>>
> >>> -static inline int vdpa_reset(struct vdpa_device *vdev)
> >>> +static inline int vdpa_reset(struct vdpa_device *vdev, u32 flags)
> >>>   {
> >>>  const struct vdpa_config_ops *ops = vdev->config;
> >>>  int ret;
> >>>
> >>>  down_write(&vdev->cf_lock);
> >>>  vdev->features_valid = false;
> >>> -   ret = ops->reset(vdev);
> >>> +   if (ops->compat_reset && flags)
> >>> +   ret = ops->compat_reset(vdev, flags);
> >>> +   else
> >>> +   ret = ops->reset(vdev);
> >> Instead of inventing a new API that carries the flags. Tweak the
> >> existing one seems to be simpler and better?
> > Well, as indicated in the commit message, this allows vhost-vdpa be
> > able to know which driver had broken behavior before, so it
> > can apply the corresponding compatibility quirk to the individual
> > driver when it's really necessary. If sending all flags
> > unconditionally down to every driver,

It depends on whether IOTLB_PERSIST is set.

> it's hard for driver writers to
> > distinguish which are compatibility quirks that they can safely ignore
> > and which are feature flags that are encouraged to implement. In that
> > sense, gating features from being polluted by compatibility quirks
> > with an implicit op
> s/implicit/explicit/
> > would be better.

Both of us have th

Re: Bug#1054514: linux-image-6.1.0-13-amd64: Debian VM with qxl graphics freezes frequently

2023-10-24 Thread Salvatore Bonaccorso
Hi Timo,

On Tue, Oct 24, 2023 at 11:14:32PM +0300, Timo Lindfors wrote:
> Package: src:linux
> Version: 6.1.55-1
> Severity: normal
> 
> Steps to reproduce:
> 1) Install Debian 12 as a virtual machine using virt-manager, choose qxl
>graphics card. You only need basic installation without wayland or X.
> 2) Login from the console and save thë following to reproduce.bash:
> 
> #!/bin/bash
> 
> chvt 3
> for j in $(seq 80); do
> echo "$(date) starting round $j"
> if [ "$(journalctl --boot | grep "failed to allocate VRAM BO")" != "" ];
> then
> echo "bug was reproduced after $j tries"
> exit 1
> fi
> for i in $(seq 100); do
> dmesg > /dev/tty3
> done
> done
> 
> echo "bug could not be reproduced"
> exit 0
> 
> 
> 3) Run chmod a+x reproduce.bash
> 4) Run ./reproduce.bash and wait for up to 20 minutes.
> 
> Expected results:
> 4) The system prints a steady flow of text without kernel error messages
> 
> Actual messages:
> 4) At some point the text stops flowing and the script prints "bug was
>reproduced". If you run "journalctl --boot" you see
> 
> kernel: [TTM] Buffer eviction failed
> kernel: qxl :00:02.0: object_init failed for (3149824, 0x0001)
> kernel: [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate VRAM BO
> 
> 
> 
> More info:
> 1) The bug does not occur if I downgrade the kernel to
>linux-image-5.10.0-26-amd64_5.10.197-1_amd64.deb from Debian 11.
> 2) I used the following test_linux.bash to bisect this issue against
>upstream source:
> 
> #!/bin/bash
> set -x
> 
> gitversion="$(git describe HEAD|sed 's@^v@@')"
> 
> git checkout drivers/gpu/drm/ttm/ttm_bo.c include/drm/ttm/ttm_bo_api.h
> git show bec771b5e0901f4b0bc861bcb58056de5151ae3a | patch -p1
> # Build
> cp ~/kernel.config .config
> # cp /boot/config-$(uname -r) .config
> # scripts/config --enable LOCALVERSION_AUTO
> # scripts/config --disable DEBUG_INFO
> # scripts/config --disable SYSTEM_TRUSTED_KEYRING
> # scripts/config --set-str SYSTEM_TRUSTED_KEYS ''
> # scripts/config --disable STACKPROTECTOR_STRONG
> make olddefconfig
> # make localmodconfig
> make -j$(nproc --all) bindeb-pkg
> rc="$?"
> if [ "$rc" != "0" ]; then
> exit 125
> fi
> git checkout drivers/gpu/drm/ttm/ttm_bo.c include/drm/ttm/ttm_bo_api.h
> 
> package="$(ls --sort=time ../linux-image-*_amd64.deb|head -n1)"
> version=$(echo $package | cut -d_ -f1|cut -d- -f3-)
> 
> if [ "$gitversion" != "$version" ]; then
> echo "Build produced version $gitversion but got $version, ignoring"
> #exit 255
> fi
> 
> # Deploy
> scp $package target:a.deb
> ssh target sudo apt install ./a.deb
> ssh target rm -f a.deb
> ssh target ./grub_set_default_version.bash $version
> ssh target sudo shutdown -r now
> sleep 40
> 
> detected_version=$(ssh target uname -r)
> if [ "$detected_version" != "$version" ]; then
> echo "Booted to $detected_version but expected $version"
> exit 255
> fi
> 
> # Test
> exec ssh target sudo ./reproduce.bash
> 
> 
> Bisect printed the following log:
> 
> git bisect start
> # bad: [ed29c2691188cf7ea2a46d40b891836c2bd1a4f5] drm/i915: Fix userptr so we 
> do not have to worry about obj->mm.lock, v7.
> git bisect bad ed29c2691188cf7ea2a46d40b891836c2bd1a4f5
> # bad: [762949bb1da78941b25e63f7e952af037eee15a9] drm: fix 
> drm_mode_create_blob comment
> git bisect bad 762949bb1da78941b25e63f7e952af037eee15a9
> # bad: [e40f97ef12772f8eb04b6a155baa1e0e2e8f3ecc] drm/gma500: Drop DRM_GMA600 
> config option
> git bisect bad e40f97ef12772f8eb04b6a155baa1e0e2e8f3ecc
> # bad: [5a838e5d5825c85556011478abde708251cc0776] drm/qxl: simplify 
> qxl_fence_wait
> git bisect bad 5a838e5d5825c85556011478abde708251cc0776
> # bad: [d2b6f8a179194de0ffc4886ffc2c4358d86047b8] Merge tag 
> 'xfs-5.13-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
> git bisect bad d2b6f8a179194de0ffc4886ffc2c4358d86047b8
> # bad: [68a32ba14177d4a21c4a9a941cf1d7aea86d436f] Merge tag 
> 'drm-next-2021-04-28' of git://anongit.freedesktop.org/drm/drm
> git bisect bad 68a32ba14177d4a21c4a9a941cf1d7aea86d436f
> # bad: [0698b13403788a646073fcd9b2294f2dce0ce429] drm/amdgpu: skip 
> PP_MP1_STATE_UNLOAD on aldebaran
> git bisect bad 0698b13403788a646073fcd9b2294f2dce0ce429
> # bad: [e1a5e6a8c48bf99ea374fb3e535661cfe226bca4] drm/doc: Add RFC section
> git bisect bad e1a5e6a8c48bf99ea374fb3e535661cfe226bca4
> # bad: [ed29c2691188cf7ea2a46d40b891836c2bd1a4f5] drm/i915: Fix userptr so we 
> do not have to worry about obj->mm.lock, v7.
> git bisect bad ed29c2691188cf7ea2a46d40b891836c2bd1a4f5
> # bad: [2c8ab3339e398bbbcb0980933e266b93bedaae52] drm/i915: Pin timeline map 
> after first timeline pin, v4.
> git bisect bad 2c8ab3339e398bbbcb0980933e266b93bedaae52
> # bad: [2eb8e1a69d9f8cc9c0a75e327f854957224ba421] drm/i915/gem: Drop 
> relocation support on all new hardware (v6)
> git bisect bad 2eb8e1a69d9f8cc9c0a75e327f854957224ba421
> # bad: [b5b6f6a610127b17f20c0ca03dd27beee4ddc2b2] drm/i915/gem: Drop legacy 
> execbuffer support (v

Re: [PATCH] virtio_console: Annotate struct port_buffer with __counted_by

2023-10-24 Thread Kees Cook
On Fri, 22 Sep 2023 10:51:15 -0700, Kees Cook wrote:
> Prepare for the coming implementation by GCC and Clang of the __counted_by
> attribute. Flexible array members annotated with __counted_by can have
> their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
> (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
> functions).
> 
> As found with Coccinelle[1], add __counted_by for struct port_buffer.
> 
> [...]

Applied to for-next/hardening, thanks!

[1/1] virtio_console: Annotate struct port_buffer with __counted_by
  https://git.kernel.org/kees/c/bf5abc17bc43

Take care,

-- 
Kees Cook

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V1 vfio 6/9] virtio-pci: Introduce APIs to execute legacy IO admin commands

2023-10-24 Thread Michael S. Tsirkin
On Tue, Oct 17, 2023 at 04:42:14PM +0300, Yishai Hadas wrote:
> Introduce APIs to execute legacy IO admin commands.
> 
> It includes: list_query/use, io_legacy_read/write,
> io_legacy_notify_info.
> 
> Those APIs will be used by the next patches from this series.
> 
> Signed-off-by: Yishai Hadas 
> ---
>  drivers/virtio/virtio_pci_common.c |  11 ++
>  drivers/virtio/virtio_pci_common.h |   2 +
>  drivers/virtio/virtio_pci_modern.c | 206 +
>  include/linux/virtio_pci_admin.h   |  18 +++
>  4 files changed, 237 insertions(+)
>  create mode 100644 include/linux/virtio_pci_admin.h
> 
> diff --git a/drivers/virtio/virtio_pci_common.c 
> b/drivers/virtio/virtio_pci_common.c
> index 6b4766d5abe6..212d68401d2c 100644
> --- a/drivers/virtio/virtio_pci_common.c
> +++ b/drivers/virtio/virtio_pci_common.c
> @@ -645,6 +645,17 @@ static struct pci_driver virtio_pci_driver = {
>   .sriov_configure = virtio_pci_sriov_configure,
>  };
>  
> +struct virtio_device *virtio_pci_vf_get_pf_dev(struct pci_dev *pdev)
> +{
> + struct virtio_pci_device *pf_vp_dev;
> +
> + pf_vp_dev = pci_iov_get_pf_drvdata(pdev, &virtio_pci_driver);
> + if (IS_ERR(pf_vp_dev))
> + return NULL;
> +
> + return &pf_vp_dev->vdev;
> +}
> +
>  module_pci_driver(virtio_pci_driver);
>  
>  MODULE_AUTHOR("Anthony Liguori ");
> diff --git a/drivers/virtio/virtio_pci_common.h 
> b/drivers/virtio/virtio_pci_common.h
> index a21b9ba01a60..2785e61ed668 100644
> --- a/drivers/virtio/virtio_pci_common.h
> +++ b/drivers/virtio/virtio_pci_common.h
> @@ -155,4 +155,6 @@ static inline void virtio_pci_legacy_remove(struct 
> virtio_pci_device *vp_dev)
>  int virtio_pci_modern_probe(struct virtio_pci_device *);
>  void virtio_pci_modern_remove(struct virtio_pci_device *);
>  
> +struct virtio_device *virtio_pci_vf_get_pf_dev(struct pci_dev *pdev);
> +
>  #endif
> diff --git a/drivers/virtio/virtio_pci_modern.c 
> b/drivers/virtio/virtio_pci_modern.c
> index cc159a8e6c70..00b65e20b2f5 100644
> --- a/drivers/virtio/virtio_pci_modern.c
> +++ b/drivers/virtio/virtio_pci_modern.c
> @@ -719,6 +719,212 @@ static void vp_modern_destroy_avq(struct virtio_device 
> *vdev)
>   vp_dev->del_vq(&vp_dev->admin_vq.info);
>  }
>  
> +/*
> + * virtio_pci_admin_list_query - Provides to driver list of commands
> + * supported for the PCI VF.
> + * @dev: VF pci_dev
> + * @buf: buffer to hold the returned list
> + * @buf_size: size of the given buffer
> + *
> + * Returns 0 on success, or negative on failure.
> + */
> +int virtio_pci_admin_list_query(struct pci_dev *pdev, u8 *buf, int buf_size)
> +{
> + struct virtio_device *virtio_dev = virtio_pci_vf_get_pf_dev(pdev);
> + struct virtio_admin_cmd cmd = {};
> + struct scatterlist result_sg;
> +
> + if (!virtio_dev)
> + return -ENODEV;
> +
> + sg_init_one(&result_sg, buf, buf_size);
> + cmd.opcode = cpu_to_le16(VIRTIO_ADMIN_CMD_LIST_QUERY);
> + cmd.group_type = cpu_to_le16(VIRTIO_ADMIN_GROUP_TYPE_SRIOV);
> + cmd.result_sg = &result_sg;
> +
> + return vp_modern_admin_cmd_exec(virtio_dev, &cmd);
> +}
> +EXPORT_SYMBOL_GPL(virtio_pci_admin_list_query);
> +
> +/*
> + * virtio_pci_admin_list_use - Provides to device list of commands
> + * used for the PCI VF.
> + * @dev: VF pci_dev
> + * @buf: buffer which holds the list
> + * @buf_size: size of the given buffer
> + *
> + * Returns 0 on success, or negative on failure.
> + */
> +int virtio_pci_admin_list_use(struct pci_dev *pdev, u8 *buf, int buf_size)
> +{
> + struct virtio_device *virtio_dev = virtio_pci_vf_get_pf_dev(pdev);
> + struct virtio_admin_cmd cmd = {};
> + struct scatterlist data_sg;
> +
> + if (!virtio_dev)
> + return -ENODEV;
> +
> + sg_init_one(&data_sg, buf, buf_size);
> + cmd.opcode = cpu_to_le16(VIRTIO_ADMIN_CMD_LIST_USE);
> + cmd.group_type = cpu_to_le16(VIRTIO_ADMIN_GROUP_TYPE_SRIOV);
> + cmd.data_sg = &data_sg;
> +
> + return vp_modern_admin_cmd_exec(virtio_dev, &cmd);
> +}
> +EXPORT_SYMBOL_GPL(virtio_pci_admin_list_use);

list commands are actually for a group, not for the VF.

> +
> +/*
> + * virtio_pci_admin_legacy_io_write - Write legacy registers of a member 
> device
> + * @dev: VF pci_dev
> + * @opcode: op code of the io write command

opcode is actually either VIRTIO_ADMIN_CMD_LEGACY_COMMON_CFG_WRITE
or VIRTIO_ADMIN_CMD_LEGACY_DEV_CFG_WRITE correct?

So please just add 2 APIs for this so users don't need to care.
Could be wrappers around these two things.




> + * @offset: starting byte offset within the registers to write to
> + * @size: size of the data to write
> + * @buf: buffer which holds the data
> + *
> + * Returns 0 on success, or negative on failure.
> + */
> +int virtio_pci_admin_legacy_io_write(struct pci_dev *pdev, u16 opcode,
> +  u8 offset, u8 size, u8 *buf)
> +{
> + struct virtio_device *virtio_dev = virtio_pci_vf_get_pf_dev(pdev);
> + struct virtio_admin_cmd_l

Re: [PATCH V1 vfio 9/9] vfio/virtio: Introduce a vfio driver over virtio devices

2023-10-24 Thread Alex Williamson
On Tue, 17 Oct 2023 16:42:17 +0300
Yishai Hadas  wrote:

> Introduce a vfio driver over virtio devices to support the legacy
> interface functionality for VFs.
> 
> Background, from the virtio spec [1].
> 
> In some systems, there is a need to support a virtio legacy driver with
> a device that does not directly support the legacy interface. In such
> scenarios, a group owner device can provide the legacy interface
> functionality for the group member devices. The driver of the owner
> device can then access the legacy interface of a member device on behalf
> of the legacy member device driver.
> 
> For example, with the SR-IOV group type, group members (VFs) can not
> present the legacy interface in an I/O BAR in BAR0 as expected by the
> legacy pci driver. If the legacy driver is running inside a virtual
> machine, the hypervisor executing the virtual machine can present a
> virtual device with an I/O BAR in BAR0. The hypervisor intercepts the
> legacy driver accesses to this I/O BAR and forwards them to the group
> owner device (PF) using group administration commands.
> 
> 
> Specifically, this driver adds support for a virtio-net VF to be exposed
> as a transitional device to a guest driver and allows the legacy IO BAR
> functionality on top.
> 
> This allows a VM which uses a legacy virtio-net driver in the guest to
> work transparently over a VF which its driver in the host is that new
> driver.
> 
> The driver can be extended easily to support some other types of virtio
> devices (e.g virtio-blk), by adding in a few places the specific type
> properties as was done for virtio-net.
> 
> For now, only the virtio-net use case was tested and as such we introduce
> the support only for such a device.
> 
> Practically,
> Upon probing a VF for a virtio-net device, in case its PF supports
> legacy access over the virtio admin commands and the VF doesn't have BAR
> 0, we set some specific 'vfio_device_ops' to be able to simulate in SW a
> transitional device with I/O BAR in BAR 0.
> 
> The existence of the simulated I/O bar is reported later on by
> overwriting the VFIO_DEVICE_GET_REGION_INFO command and the device
> exposes itself as a transitional device by overwriting some properties
> upon reading its config space.
> 
> Once we report the existence of I/O BAR as BAR 0 a legacy driver in the
> guest may use it via read/write calls according to the virtio
> specification.
> 
> Any read/write towards the control parts of the BAR will be captured by
> the new driver and will be translated into admin commands towards the
> device.
> 
> Any data path read/write access (i.e. virtio driver notifications) will
> be forwarded to the physical BAR which its properties were supplied by
> the admin command VIRTIO_ADMIN_CMD_LEGACY_NOTIFY_INFO upon the
> probing/init flow.
> 
> With that code in place a legacy driver in the guest has the look and
> feel as if having a transitional device with legacy support for both its
> control and data path flows.
> 
> [1]
> https://github.com/oasis-tcs/virtio-spec/commit/03c2d32e5093ca9f2a17797242fbef88efe94b8c
> 
> Signed-off-by: Yishai Hadas 
> ---
>  MAINTAINERS  |   7 +
>  drivers/vfio/pci/Kconfig |   2 +
>  drivers/vfio/pci/Makefile|   2 +
>  drivers/vfio/pci/virtio/Kconfig  |  15 +
>  drivers/vfio/pci/virtio/Makefile |   4 +
>  drivers/vfio/pci/virtio/main.c   | 577 +++
>  6 files changed, 607 insertions(+)
>  create mode 100644 drivers/vfio/pci/virtio/Kconfig
>  create mode 100644 drivers/vfio/pci/virtio/Makefile
>  create mode 100644 drivers/vfio/pci/virtio/main.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 7a7bd8bd80e9..680a70063775 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -22620,6 +22620,13 @@ L:   k...@vger.kernel.org
>  S:   Maintained
>  F:   drivers/vfio/pci/mlx5/
>  
> +VFIO VIRTIO PCI DRIVER
> +M:   Yishai Hadas 
> +L:   k...@vger.kernel.org
> +L:   virtualization@lists.linux-foundation.org
> +S:   Maintained
> +F:   drivers/vfio/pci/virtio
> +
>  VFIO PCI DEVICE SPECIFIC DRIVERS
>  R:   Jason Gunthorpe 
>  R:   Yishai Hadas 
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index 8125e5f37832..18c397df566d 100644
> --- a/drivers/vfio/pci/Kconfig
> +++ b/drivers/vfio/pci/Kconfig
> @@ -65,4 +65,6 @@ source "drivers/vfio/pci/hisilicon/Kconfig"
>  
>  source "drivers/vfio/pci/pds/Kconfig"
>  
> +source "drivers/vfio/pci/virtio/Kconfig"
> +
>  endmenu
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index 45167be462d8..046139a4eca5 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -13,3 +13,5 @@ obj-$(CONFIG_MLX5_VFIO_PCI)   += mlx5/
>  obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/
>  
>  obj-$(CONFIG_PDS_VFIO_PCI) += pds/
> +
> +obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio/
> diff 

Re: [PATCH v4 0/7] vdpa: decouple reset of iotlb mapping from device reset

2023-10-24 Thread Si-Wei Liu
Thanks a lot for testing! Please be aware that there's a follow-up fix 
for a potential oops in this v4 series:


https://lore.kernel.org/virtualization/1698102863-21122-1-git-send-email-si-wei@oracle.com/

Would be nice to have it applied for any tests.

Thanks,
-Siwei

On 10/23/2023 11:51 PM, Lei Yang wrote:

QE tested this series v4 with regression testing on real nic, there is
no new regression bug.

Tested-by: Lei Yang 

On Tue, Oct 24, 2023 at 6:02 AM Si-Wei Liu  wrote:



On 10/22/2023 8:51 PM, Jason Wang wrote:

Hi Si-Wei:

On Sat, Oct 21, 2023 at 5:28 PM Si-Wei Liu  wrote:

In order to reduce needlessly high setup and teardown cost
of iotlb mapping during live migration, it's crucial to
decouple the vhost-vdpa iotlb abstraction from the virtio
device life cycle, i.e. iotlb mappings should be left
intact across virtio device reset [1]. For it to work, the
on-chip IOMMU parent device could implement a separate
.reset_map() operation callback to restore 1:1 DMA mapping
without having to resort to the .reset() callback, the
latter of which is mainly used to reset virtio device state.
This new .reset_map() callback will be invoked only before
the vhost-vdpa driver is to be removed and detached from
the vdpa bus, such that other vdpa bus drivers, e.g.
virtio-vdpa, can start with 1:1 DMA mapping when they
are attached. For the context, those on-chip IOMMU parent
devices, create the 1:1 DMA mapping at vdpa device creation,
and they would implicitly destroy the 1:1 mapping when
the first .set_map or .dma_map callback is invoked.

This patchset is rebased on top of the latest vhost tree.

[1] Reducing vdpa migration downtime because of memory pin / maps
https://www.mail-archive.com/qemu-devel@nongnu.org/msg953755.html

---
v4:
- Rework compatibility using new .compat_reset driver op

I still think having a set_backend_feature()

This will overload backend features with the role of carrying over
compatibility quirks, which I tried to avoid from. While I think the
.compat_reset from the v4 code just works with the backend features
acknowledgement (and maybe others as well) to determine, but not
directly tie it to backend features itself. These two have different
implications in terms of requirement, scope and maintaining/deprecation,
better to cope with compat quirks in explicit and driver visible way.


   or reset_map(clean=true) might be better.

An explicit op might be marginally better in driver writer's point of
view. Compliant driver doesn't have to bother asserting clean_map never
be true so their code would never bother dealing with this case, as
explained in the commit log for patch 5 "vhost-vdpa: clean iotlb map
during reset for older userspace":

"
  The separation of .compat_reset from the regular .reset allows
  vhost-vdpa able to know which driver had broken behavior before, so it
  can apply the corresponding compatibility quirk to the individual
driver
  whenever needed.  Compared to overloading the existing .reset with
  flags, .compat_reset won't cause any extra burden to the implementation
  of every compliant driver.
"


   As it tries hard to not introduce new stuff on the bus.

Honestly I don't see substantial difference between these other than the
color. There's no single best solution that stands out among the 3. And
I assume you already noticed it from all the above 3 approaches will
have to go with backend features negotiation, that the 1st vdpa reset
before backend feature negotiation will use the compliant version of
.reset that doesn't clean up the map. While I don't think this nuance
matters much to existing older userspace apps, as the maps should
already get cleaned by previous process in vhost_vdpa_cleanup(), but if
bug-for-bug behavioral compatibility is what you want, module parameter
will be the single best answer.

Regards,
-Siwei


But we can listen to others for sure.

Thanks



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v4 5/7] vhost-vdpa: clean iotlb map during reset for older userspace

2023-10-24 Thread Si-Wei Liu



On 10/24/2023 9:21 AM, Si-Wei Liu wrote:



On 10/23/2023 10:45 PM, Jason Wang wrote:
On Sat, Oct 21, 2023 at 5:28 PM Si-Wei Liu  
wrote:

Using .compat_reset op from the previous patch, the buggy .reset
behaviour can be kept as-is on older userspace apps, which don't ack 
the
IOTLB_PERSIST backend feature. As this compatibility quirk is 
limited to

those drivers that used to be buggy in the past, it won't affect change
the behaviour or affect ABI on the setups with API compliant driver.

The separation of .compat_reset from the regular .reset allows
vhost-vdpa able to know which driver had broken behaviour before, so it
can apply the corresponding compatibility quirk to the individual 
driver

whenever needed.  Compared to overloading the existing .reset with
flags, .compat_reset won't cause any extra burden to the implementation
of every compliant driver.

Signed-off-by: Si-Wei Liu 
---
  drivers/vhost/vdpa.c | 17 +
  drivers/virtio/virtio_vdpa.c |  2 +-
  include/linux/vdpa.h |  7 +--
  3 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index acc7c74ba7d6..9ce40003793b 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -227,13 +227,22 @@ static void vhost_vdpa_unsetup_vq_irq(struct 
vhost_vdpa *v, u16 qid)

irq_bypass_unregister_producer(&vq->call_ctx.producer);
  }

-static int vhost_vdpa_reset(struct vhost_vdpa *v)
+static int _compat_vdpa_reset(struct vhost_vdpa *v)
  {
 struct vdpa_device *vdpa = v->vdpa;
+   u32 flags = 0;

-   v->in_batch = 0;
+   flags |= !vhost_backend_has_feature(v->vdev.vqs[0],
+ VHOST_BACKEND_F_IOTLB_PERSIST) ?
+    VDPA_RESET_F_CLEAN_MAP : 0;
+
+   return vdpa_reset(vdpa, flags);
+}

-   return vdpa_reset(vdpa);
+static int vhost_vdpa_reset(struct vhost_vdpa *v)
+{
+   v->in_batch = 0;
+   return _compat_vdpa_reset(v);
  }

  static long vhost_vdpa_bind_mm(struct vhost_vdpa *v)
@@ -312,7 +321,7 @@ static long vhost_vdpa_set_status(struct 
vhost_vdpa *v, u8 __user *statusp)

 vhost_vdpa_unsetup_vq_irq(v, i);

 if (status == 0) {
-   ret = vdpa_reset(vdpa);
+   ret = _compat_vdpa_reset(v);
 if (ret)
 return ret;
 } else
diff --git a/drivers/virtio/virtio_vdpa.c 
b/drivers/virtio/virtio_vdpa.c

index 06ce6d8c2e00..8d63e5923d24 100644
--- a/drivers/virtio/virtio_vdpa.c
+++ b/drivers/virtio/virtio_vdpa.c
@@ -100,7 +100,7 @@ static void virtio_vdpa_reset(struct 
virtio_device *vdev)

  {
 struct vdpa_device *vdpa = vd_get_vdpa(vdev);

-   vdpa_reset(vdpa);
+   vdpa_reset(vdpa, 0);
  }

  static bool virtio_vdpa_notify(struct virtqueue *vq)
diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 6b8cbf75712d..db15ac07f8a6 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -519,14 +519,17 @@ static inline struct device 
*vdpa_get_dma_dev(struct vdpa_device *vdev)

 return vdev->dma_dev;
  }

-static inline int vdpa_reset(struct vdpa_device *vdev)
+static inline int vdpa_reset(struct vdpa_device *vdev, u32 flags)
  {
 const struct vdpa_config_ops *ops = vdev->config;
 int ret;

 down_write(&vdev->cf_lock);
 vdev->features_valid = false;
-   ret = ops->reset(vdev);
+   if (ops->compat_reset && flags)
+   ret = ops->compat_reset(vdev, flags);
+   else
+   ret = ops->reset(vdev);

Instead of inventing a new API that carries the flags. Tweak the
existing one seems to be simpler and better?
Well, as indicated in the commit message, this allows vhost-vdpa be 
able to know which driver had broken behavior before, so it
can apply the corresponding compatibility quirk to the individual 
driver when it's really necessary. If sending all flags 
unconditionally down to every driver, it's hard for driver writers to 
distinguish which are compatibility quirks that they can safely ignore 
and which are feature flags that are encouraged to implement. In that 
sense, gating features from being polluted by compatibility quirks 
with an implicit op 

s/implicit/explicit/

would be better.

Regards,
-Siwei


As compat_reset(vdev, 0) == reset(vdev)

Then you don't need the switch in the parent as well

+static int vdpasim_reset(struct vdpa_device *vdpa)
+{
+   return vdpasim_compat_reset(vdpa, 0);
+}

Thanks



up_write(&vdev->cf_lock);
 return ret;
  }
--
2.39.3





___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v4 5/7] vhost-vdpa: clean iotlb map during reset for older userspace

2023-10-24 Thread Si-Wei Liu



On 10/23/2023 10:45 PM, Jason Wang wrote:

On Sat, Oct 21, 2023 at 5:28 PM Si-Wei Liu  wrote:

Using .compat_reset op from the previous patch, the buggy .reset
behaviour can be kept as-is on older userspace apps, which don't ack the
IOTLB_PERSIST backend feature. As this compatibility quirk is limited to
those drivers that used to be buggy in the past, it won't affect change
the behaviour or affect ABI on the setups with API compliant driver.

The separation of .compat_reset from the regular .reset allows
vhost-vdpa able to know which driver had broken behaviour before, so it
can apply the corresponding compatibility quirk to the individual driver
whenever needed.  Compared to overloading the existing .reset with
flags, .compat_reset won't cause any extra burden to the implementation
of every compliant driver.

Signed-off-by: Si-Wei Liu 
---
  drivers/vhost/vdpa.c | 17 +
  drivers/virtio/virtio_vdpa.c |  2 +-
  include/linux/vdpa.h |  7 +--
  3 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index acc7c74ba7d6..9ce40003793b 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -227,13 +227,22 @@ static void vhost_vdpa_unsetup_vq_irq(struct vhost_vdpa 
*v, u16 qid)
 irq_bypass_unregister_producer(&vq->call_ctx.producer);
  }

-static int vhost_vdpa_reset(struct vhost_vdpa *v)
+static int _compat_vdpa_reset(struct vhost_vdpa *v)
  {
 struct vdpa_device *vdpa = v->vdpa;
+   u32 flags = 0;

-   v->in_batch = 0;
+   flags |= !vhost_backend_has_feature(v->vdev.vqs[0],
+   VHOST_BACKEND_F_IOTLB_PERSIST) ?
+VDPA_RESET_F_CLEAN_MAP : 0;
+
+   return vdpa_reset(vdpa, flags);
+}

-   return vdpa_reset(vdpa);
+static int vhost_vdpa_reset(struct vhost_vdpa *v)
+{
+   v->in_batch = 0;
+   return _compat_vdpa_reset(v);
  }

  static long vhost_vdpa_bind_mm(struct vhost_vdpa *v)
@@ -312,7 +321,7 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 
__user *statusp)
 vhost_vdpa_unsetup_vq_irq(v, i);

 if (status == 0) {
-   ret = vdpa_reset(vdpa);
+   ret = _compat_vdpa_reset(v);
 if (ret)
 return ret;
 } else
diff --git a/drivers/virtio/virtio_vdpa.c b/drivers/virtio/virtio_vdpa.c
index 06ce6d8c2e00..8d63e5923d24 100644
--- a/drivers/virtio/virtio_vdpa.c
+++ b/drivers/virtio/virtio_vdpa.c
@@ -100,7 +100,7 @@ static void virtio_vdpa_reset(struct virtio_device *vdev)
  {
 struct vdpa_device *vdpa = vd_get_vdpa(vdev);

-   vdpa_reset(vdpa);
+   vdpa_reset(vdpa, 0);
  }

  static bool virtio_vdpa_notify(struct virtqueue *vq)
diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 6b8cbf75712d..db15ac07f8a6 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -519,14 +519,17 @@ static inline struct device *vdpa_get_dma_dev(struct 
vdpa_device *vdev)
 return vdev->dma_dev;
  }

-static inline int vdpa_reset(struct vdpa_device *vdev)
+static inline int vdpa_reset(struct vdpa_device *vdev, u32 flags)
  {
 const struct vdpa_config_ops *ops = vdev->config;
 int ret;

 down_write(&vdev->cf_lock);
 vdev->features_valid = false;
-   ret = ops->reset(vdev);
+   if (ops->compat_reset && flags)
+   ret = ops->compat_reset(vdev, flags);
+   else
+   ret = ops->reset(vdev);

Instead of inventing a new API that carries the flags. Tweak the
existing one seems to be simpler and better?
Well, as indicated in the commit message, this allows vhost-vdpa be able 
to know which driver had broken behavior before, so it
can apply the corresponding compatibility quirk to the individual driver 
when it's really necessary. If sending all flags unconditionally down to 
every driver, it's hard for driver writers to distinguish which are 
compatibility quirks that they can safely ignore and which are feature 
flags that are encouraged to implement. In that sense, gating features 
from being polluted by compatibility quirks with an implicit op would be 
better.


Regards,
-Siwei


As compat_reset(vdev, 0) == reset(vdev)

Then you don't need the switch in the parent as well

+static int vdpasim_reset(struct vdpa_device *vdpa)
+{
+   return vdpasim_compat_reset(vdpa, 0);
+}

Thanks



 up_write(&vdev->cf_lock);
 return ret;
  }
--
2.39.3



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v4 4/4] vduse: Add LSM hooks to check Virtio device type

2023-10-24 Thread Casey Schaufler
On 10/24/2023 2:49 AM, Maxime Coquelin wrote:
>
>
> On 10/23/23 17:13, Casey Schaufler wrote:
>> On 10/23/2023 12:28 AM, Maxime Coquelin wrote:
>>>
>>>
>>> On 10/21/23 00:20, Casey Schaufler wrote:
 On 10/20/2023 8:58 AM, Maxime Coquelin wrote:
> This patch introduces LSM hooks for devices creation,
> destruction and opening operations, checking the
> application is allowed to perform these operations for
> the Virtio device type.

 Why do you think that there needs to be a special LSM check for virtio
 devices? What can't existing device attributes be used?
>>>
>>> Michael asked for a way for SELinux to allow/prevent the creation of
>>> some types of devices [0].
>>>
>>> A device is created using ioctl() on VDUSE control chardev. Its type is
>>> specified via a field in the structure passed in argument.
>>>
>>> I didn't see other way than adding dedicated LSM hooks to achieve this,
>>> but it is possible that their is a better way to do it?
>>
>> At the very least the hook should be made more general, and I'd have to
>> see a proposal before commenting on that. security_dev_destroy(dev)
>> might
>> be a better approach. If there's reason to control destruction of vduse
>> devices it's reasonable to assume that there are other devices with the
>> same or similar properties.
>
> VDUSE is different from other devices as the device is actually
> implemented by the user-space application, so this is very specific in
> my opinion.

This is hardly unique. If you're implementing the device
in user-space you may well be able to implement the desired
controls there.

>
>>
>> Since SELinux is your target use case, can you explain why you can't
>> create SELinux policy to enforce the restrictions you're after? I
>> believe
>> (but can be proven wrong, of course) that SELinux has mechanism for
>> dealing
>> with controls on ioctls.
>>
>
> I am not aware of such mechanism to deal with ioctl(), if you have a
> pointer that would be welcome.

security/selinux/hooks.c

>
> Thanks,
> Maxime
>
>>
>>>
>>> Thanks,
>>> Maxime
>>>
>>> [0]:
>>> https://lore.kernel.org/all/20230829130430-mutt-send-email-...@kernel.org/
>>>
>>>
>>
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/2] virtio_pci: Don't make an extra copy of cpu affinity mask

2023-10-24 Thread Jakub Sitnicki via Virtualization
On Tue, Oct 24, 2023 at 07:46 PM +08, Xuan Zhuo wrote:
> On Tue, 24 Oct 2023 13:26:49 +0200, Jakub Sitnicki  
> wrote:
>> On Tue, Oct 24, 2023 at 06:53 PM +08, Xuan Zhuo wrote:
>> > On Tue, 24 Oct 2023 10:17:19 +0200, Jakub Sitnicki  
>> > wrote:
>> >> On Tue, Oct 24, 2023 at 10:31 AM +08, Xuan Zhuo wrote:
>> >> > On Mon, 23 Oct 2023 18:52:45 +0200, Jakub Sitnicki 
>> >> >  wrote:
>> >> >> On Thu, Oct 19, 2023 at 08:55 PM +08, Xuan Zhuo wrote:
>> >> >> > On Thu, 19 Oct 2023 12:16:24 +0200, Jakub Sitnicki 
>> >> >> >  wrote:
>> >> >> >> Since commit 19e226e8cc5d ("virtio: Make vp_set_vq_affinity() take a
>> >> >> >> mask.") it is actually not needed to have a local copy of the cpu 
>> >> >> >> mask.
>> >> >> >
>> >> >> >
>> >> >> > Could you give more info to prove this?
>> >> >
>> >> >
>> >> > Actually, my question is that can we pass a val on the stack(or temp 
>> >> > value) to
>> >> > the irq_set_affinity_hint()?
>> >> >
>> >> > Such as the virtio-net uses zalloc_cpumask_var to alloc a cpu_mask, and
>> >> > that will be released.
>> >> >
>> >> >
>> >> >
>> >> > int __irq_apply_affinity_hint(unsigned int irq, const struct 
>> >> > cpumask *m,
>> >> >   bool setaffinity)
>> >> > {
>> >> > unsigned long flags;
>> >> > struct irq_desc *desc = irq_get_desc_lock(irq, &flags, 
>> >> > IRQ_GET_DESC_CHECK_GLOBAL);
>> >> >
>> >> > if (!desc)
>> >> > return -EINVAL;
>> >> > ->  desc->affinity_hint = m;
>> >> > irq_put_desc_unlock(desc, flags);
>> >> > if (m && setaffinity)
>> >> > __irq_set_affinity(irq, m, false);
>> >> > return 0;
>> >> > }
>> >> > EXPORT_SYMBOL_GPL(__irq_apply_affinity_hint);
>> >> >
>> >> > The above code directly refers the mask pointer. If the mask is a temp 
>> >> > value, I
>> >> > think that is a bug.
>> >>
>> >> You are completely right. irq_set_affinity_hint stores the mask pointer.
>> >> irq_affinity_hint_proc_show later dereferences it when user reads out
>> >> /proc/irq/*/affinity_hint.
>> >>
>> >> I have failed to notice that. That's why we need cpumask_copy to stay.
>> >>
>> >> My patch is buggy. Please disregard.
>> >>
>> >> I will send a v2 to only migrate from deprecated irq_set_affinity_hint.
>> >>
>> >> > And I notice that many places directly pass the temp value to this API.
>> >> > And I am a little confused. ^_^ Or I missed something.
>> >>
>> >> There seem two be two gropus of callers:
>> >>
>> >> 1. Those that use get_cpu_mask/cpumask_of/cpumask_of_node to produce a
>> >>cpumask pointer which is a preallocated constant.
>> >>
>> >>$ weggli 'irq_set_affinity_hint(_, $func(_));' ~/src/linux
>> >>
>> >> 2. Those that pass a pointer to memory somewhere.
>> >>
>> >>$ weggli 'irq_set_affinity_hint(_, $mask);' ~/src/linux
>> >>
>> >> (weggli tool can be found at https://github.com/weggli-rs/weggli)
>> >>
>> >> I've looked over the callers from group #2 but I couldn't find any
>> >> passing a pointer memory on stack :-)
>> >
>> > Pls check stmmac_request_irq_multi_msi()
>>
>> Good catch. That one looks buggy.
>>
>> I should also checked for callers that take an address of a var/field:
>>
>>   $ weggli 'irq_set_affinity_hint(_, &$mask);' ~/src/linux
>
> Do you find more?

No, just the one you pointed out. Unless I missed something.

I ran an improved query. Shows everything but the non-interesting cases:

$ weggli '{
NOT: irq_set_affinity_hint(_, NULL);
NOT: irq_set_affinity_hint(_, get_cpu_mask(_));
NOT: irq_set_affinity_hint(_, cpumask_of(_));
irq_set_affinity_hint(_, _);
}' ~/src/linux

And repeated it for irq_set_affinity_and_hint and irq_update_affinity.

The calls where it was not obvious at first sight that we're passing a
pointer to some heap memory, turned out to use a temporary variable to
either store address to heap memory or return value from cpumask_of*().

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/2] virtio_pci: Don't make an extra copy of cpu affinity mask

2023-10-24 Thread Xuan Zhuo
On Tue, 24 Oct 2023 13:26:49 +0200, Jakub Sitnicki  wrote:
> On Tue, Oct 24, 2023 at 06:53 PM +08, Xuan Zhuo wrote:
> > On Tue, 24 Oct 2023 10:17:19 +0200, Jakub Sitnicki  
> > wrote:
> >> On Tue, Oct 24, 2023 at 10:31 AM +08, Xuan Zhuo wrote:
> >> > On Mon, 23 Oct 2023 18:52:45 +0200, Jakub Sitnicki 
> >> >  wrote:
> >> >> On Thu, Oct 19, 2023 at 08:55 PM +08, Xuan Zhuo wrote:
> >> >> > On Thu, 19 Oct 2023 12:16:24 +0200, Jakub Sitnicki 
> >> >> >  wrote:
> >> >> >> Since commit 19e226e8cc5d ("virtio: Make vp_set_vq_affinity() take a
> >> >> >> mask.") it is actually not needed to have a local copy of the cpu 
> >> >> >> mask.
> >> >> >
> >> >> >
> >> >> > Could you give more info to prove this?
> >> >
> >> >
> >> > Actually, my question is that can we pass a val on the stack(or temp 
> >> > value) to
> >> > the irq_set_affinity_hint()?
> >> >
> >> > Such as the virtio-net uses zalloc_cpumask_var to alloc a cpu_mask, and
> >> > that will be released.
> >> >
> >> >
> >> >
> >> >  int __irq_apply_affinity_hint(unsigned int irq, const struct cpumask *m,
> >> >bool setaffinity)
> >> >  {
> >> >  unsigned long flags;
> >> >  struct irq_desc *desc = irq_get_desc_lock(irq, &flags, 
> >> > IRQ_GET_DESC_CHECK_GLOBAL);
> >> >
> >> >  if (!desc)
> >> >  return -EINVAL;
> >> > ->   desc->affinity_hint = m;
> >> >  irq_put_desc_unlock(desc, flags);
> >> >  if (m && setaffinity)
> >> >  __irq_set_affinity(irq, m, false);
> >> >  return 0;
> >> >  }
> >> >  EXPORT_SYMBOL_GPL(__irq_apply_affinity_hint);
> >> >
> >> > The above code directly refers the mask pointer. If the mask is a temp 
> >> > value, I
> >> > think that is a bug.
> >>
> >> You are completely right. irq_set_affinity_hint stores the mask pointer.
> >> irq_affinity_hint_proc_show later dereferences it when user reads out
> >> /proc/irq/*/affinity_hint.
> >>
> >> I have failed to notice that. That's why we need cpumask_copy to stay.
> >>
> >> My patch is buggy. Please disregard.
> >>
> >> I will send a v2 to only migrate from deprecated irq_set_affinity_hint.
> >>
> >> > And I notice that many places directly pass the temp value to this API.
> >> > And I am a little confused. ^_^ Or I missed something.
> >>
> >> There seem two be two gropus of callers:
> >>
> >> 1. Those that use get_cpu_mask/cpumask_of/cpumask_of_node to produce a
> >>cpumask pointer which is a preallocated constant.
> >>
> >>$ weggli 'irq_set_affinity_hint(_, $func(_));' ~/src/linux
> >>
> >> 2. Those that pass a pointer to memory somewhere.
> >>
> >>$ weggli 'irq_set_affinity_hint(_, $mask);' ~/src/linux
> >>
> >> (weggli tool can be found at https://github.com/weggli-rs/weggli)
> >>
> >> I've looked over the callers from group #2 but I couldn't find any
> >> passing a pointer memory on stack :-)
> >
> > Pls check stmmac_request_irq_multi_msi()
>
> Good catch. That one looks buggy.
>
> I should also checked for callers that take an address of a var/field:
>
>   $ weggli 'irq_set_affinity_hint(_, &$mask);' ~/src/linux

Do you find more?

Thanks.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/2] virtio_pci: Don't make an extra copy of cpu affinity mask

2023-10-24 Thread Jakub Sitnicki via Virtualization
On Tue, Oct 24, 2023 at 06:53 PM +08, Xuan Zhuo wrote:
> On Tue, 24 Oct 2023 10:17:19 +0200, Jakub Sitnicki  
> wrote:
>> On Tue, Oct 24, 2023 at 10:31 AM +08, Xuan Zhuo wrote:
>> > On Mon, 23 Oct 2023 18:52:45 +0200, Jakub Sitnicki  
>> > wrote:
>> >> On Thu, Oct 19, 2023 at 08:55 PM +08, Xuan Zhuo wrote:
>> >> > On Thu, 19 Oct 2023 12:16:24 +0200, Jakub Sitnicki 
>> >> >  wrote:
>> >> >> Since commit 19e226e8cc5d ("virtio: Make vp_set_vq_affinity() take a
>> >> >> mask.") it is actually not needed to have a local copy of the cpu mask.
>> >> >
>> >> >
>> >> > Could you give more info to prove this?
>> >
>> >
>> > Actually, my question is that can we pass a val on the stack(or temp 
>> > value) to
>> > the irq_set_affinity_hint()?
>> >
>> > Such as the virtio-net uses zalloc_cpumask_var to alloc a cpu_mask, and
>> > that will be released.
>> >
>> >
>> >
>> >int __irq_apply_affinity_hint(unsigned int irq, const struct cpumask *m,
>> >  bool setaffinity)
>> >{
>> >unsigned long flags;
>> >struct irq_desc *desc = irq_get_desc_lock(irq, &flags, 
>> > IRQ_GET_DESC_CHECK_GLOBAL);
>> >
>> >if (!desc)
>> >return -EINVAL;
>> > -> desc->affinity_hint = m;
>> >irq_put_desc_unlock(desc, flags);
>> >if (m && setaffinity)
>> >__irq_set_affinity(irq, m, false);
>> >return 0;
>> >}
>> >EXPORT_SYMBOL_GPL(__irq_apply_affinity_hint);
>> >
>> > The above code directly refers the mask pointer. If the mask is a temp 
>> > value, I
>> > think that is a bug.
>>
>> You are completely right. irq_set_affinity_hint stores the mask pointer.
>> irq_affinity_hint_proc_show later dereferences it when user reads out
>> /proc/irq/*/affinity_hint.
>>
>> I have failed to notice that. That's why we need cpumask_copy to stay.
>>
>> My patch is buggy. Please disregard.
>>
>> I will send a v2 to only migrate from deprecated irq_set_affinity_hint.
>>
>> > And I notice that many places directly pass the temp value to this API.
>> > And I am a little confused. ^_^ Or I missed something.
>>
>> There seem two be two gropus of callers:
>>
>> 1. Those that use get_cpu_mask/cpumask_of/cpumask_of_node to produce a
>>cpumask pointer which is a preallocated constant.
>>
>>$ weggli 'irq_set_affinity_hint(_, $func(_));' ~/src/linux
>>
>> 2. Those that pass a pointer to memory somewhere.
>>
>>$ weggli 'irq_set_affinity_hint(_, $mask);' ~/src/linux
>>
>> (weggli tool can be found at https://github.com/weggli-rs/weggli)
>>
>> I've looked over the callers from group #2 but I couldn't find any
>> passing a pointer memory on stack :-)
>
> Pls check stmmac_request_irq_multi_msi()

Good catch. That one looks buggy.

I should also checked for callers that take an address of a var/field:

  $ weggli 'irq_set_affinity_hint(_, &$mask);' ~/src/linux
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3] ALSA: virtio: use ack callback

2023-10-24 Thread Matias Ezequiel Vara Larsen
On Tue, Oct 24, 2023 at 09:11:10AM +0900, Anton Yakovlev wrote:
> Hi Matias,
> 
> Thanks for the new patch!
> 
> 
> 
> On 24.10.2023 00:06, Matias Ezequiel Vara Larsen wrote:
> > This commit uses the ack() callback to determine when a buffer has been
> > updated, then exposes it to guest.
> > 
> > The current mechanism splits a dma buffer into descriptors that are
> > exposed to the device. This dma buffer is shared with the user
> > application. When the device consumes a buffer, the driver moves the
> > request from the used ring to available ring.
> > 
> > The driver exposes the buffer to the device without knowing if the
> > content has been updated from the user. The section 2.8.21.1 of the
> > virtio spec states that: "The device MAY access the descriptor chains
> > the driver created and the memory they refer to immediately". If the
> > device picks up buffers from the available ring just after it is
> > notified, it happens that the content may be old.
> > 
> > When the ack() callback is invoked, the driver exposes only the buffers
> > that have already been updated, i.e., enqueued in the available ring.
> > Thus, the device always picks up a buffer that is updated.
> > 
> > For capturing, the driver starts by exposing all the available buffers
> > to device. After device updates the content of a buffer, it enqueues it
> > in the used ring. It is only after the ack() for capturing is issued
> > that the driver re-enqueues the buffer in the available ring.
> > 
> > Co-developed-by: Anton Yakovlev 
> > Signed-off-by: Anton Yakovlev 
> > Signed-off-by: Matias Ezequiel Vara Larsen 
> > ---
> > Changelog:
> > v2 -> v3:
> >   * Use ack() callback instead of copy()/fill_silence()
> >   * Change commit name
> >   * Rewrite commit message
> >   * Remove virtsnd_pcm_msg_send_locked()
> >   * Use single callback for both capture and playback
> >   * Fix kernel test robot warnings regarding documentation
> >   * v2 patch at:
> > 
> > https://ddec1-0-en-ctp.trendmicro.com:443/wis/clicktime/v1/query?url=https%3a%2f%2flore.kernel.org%2flkml%2f87y1fzkq8c.wl%2dtiwai%40suse.de%2fT%2f&umid=6a6586e6-1bcb-48d2-8c0c-75ef6bb15df9&auth=53c7c7de28b92dfd96e93d9dd61a23e634d2fbec-090fe82db9a03f0dc8c4f214d4d2e2bf916e7f1f
> > v1 -> v2:
> >   * Use snd_pcm_set_managed_buffer_all()for buffer allocation/freeing.
> >   * Make virtsnd_pcm_msg_send() generic by specifying the offset and size
> > for the modified part of the buffer; this way no assumptions need to
> > be made.
> >   * Disable SNDRV_PCM_INFO_NO_REWINDS since now only sequential
> > reading/writing of frames is supported.
> >   * Correct comment at virtsnd_pcm_msg_send().
> >   * v1 patch at:
> > 
> > https://ddec1-0-en-ctp.trendmicro.com:443/wis/clicktime/v1/query?url=https%3a%2f%2flore.kernel.org%2flkml%2f20231016151000.GE119987%40fedora%2ft%2f&umid=6a6586e6-1bcb-48d2-8c0c-75ef6bb15df9&auth=53c7c7de28b92dfd96e93d9dd61a23e634d2fbec-2d4d809544c877beff1a631a29db01290c65d879
> > 
> >   sound/virtio/virtio_pcm.c |  1 +
> >   sound/virtio/virtio_pcm.h |  6 ++-
> >   sound/virtio/virtio_pcm_msg.c | 80 ---
> >   sound/virtio/virtio_pcm_ops.c | 30 +++--
> >   4 files changed, 79 insertions(+), 38 deletions(-)
> > 
> > diff --git a/sound/virtio/virtio_pcm.c b/sound/virtio/virtio_pcm.c
> > index c10d91fff2fb..9cc5a95b4913 100644
> > --- a/sound/virtio/virtio_pcm.c
> > +++ b/sound/virtio/virtio_pcm.c
> > @@ -124,6 +124,7 @@ static int virtsnd_pcm_build_hw(struct 
> > virtio_pcm_substream *vss,
> > values = le64_to_cpu(info->formats);
> > vss->hw.formats = 0;
> > +   vss->appl_ptr = 0;
> > for (i = 0; i < ARRAY_SIZE(g_v2a_format_map); ++i)
> > if (values & (1ULL << i)) {
> > diff --git a/sound/virtio/virtio_pcm.h b/sound/virtio/virtio_pcm.h
> > index 062eb8e8f2cf..ea3c2845ae9b 100644
> > --- a/sound/virtio/virtio_pcm.h
> > +++ b/sound/virtio/virtio_pcm.h
> > @@ -27,6 +27,7 @@ struct virtio_pcm_msg;
> >*substream operators.
> >* @buffer_bytes: Current buffer size in bytes.
> >* @hw_ptr: Substream hardware pointer value in bytes [0 ... 
> > buffer_bytes).
> > + * @appl_ptr: Substream application pointer value in bytes [0 ... 
> > buffer_bytes).
> >* @xfer_enabled: Data transfer state (0 - off, 1 - on).
> >* @xfer_xrun: Data underflow/overflow state (0 - no xrun, 1 - xrun).
> >* @stopped: True if the substream is stopped and must be released on the 
> > device
> > @@ -51,13 +52,13 @@ struct virtio_pcm_substream {
> > spinlock_t lock;
> > size_t buffer_bytes;
> > size_t hw_ptr;
> > +   size_t appl_ptr;
> > bool xfer_enabled;
> > bool xfer_xrun;
> > bool stopped;
> > bool suspended;
> > struct virtio_pcm_msg **msgs;
> > unsigned int nmsgs;
> > -   int msg_last_enqueued;
> > unsigned int msg_count;
> > wait_queue_head_t msg_empty;
> >   };
> > @@ -117,7 +118,8 @@ int virtsnd_pcm_msg_alloc(struct virtio_pcm_substream 
> > 

Re: [PATCH v3] ALSA: virtio: use ack callback

2023-10-24 Thread Matias Ezequiel Vara Larsen
On Mon, Oct 23, 2023 at 05:50:00PM +0200, Takashi Iwai wrote:
> On Mon, 23 Oct 2023 17:06:57 +0200,
> Matias Ezequiel Vara Larsen wrote:
> > 
> > +static int virtsnd_pcm_ack(struct snd_pcm_substream *substream)
> > +{
> > +   struct virtio_pcm_substream *vss = snd_pcm_substream_chip(substream);
> > +   struct virtio_snd_queue *queue = virtsnd_pcm_queue(vss);
> > +   unsigned long flags;
> > +   struct snd_pcm_runtime *runtime = vss->substream->runtime;
> > +   ssize_t appl_pos = frames_to_bytes(runtime, runtime->control->appl_ptr);
> > +   ssize_t buf_size = frames_to_bytes(runtime, runtime->buffer_size);
> > +   int rc;
> > +
> > +   spin_lock_irqsave(&queue->lock, flags);
> > +   spin_lock(&vss->lock);
> > +
> > +   ssize_t bytes = (appl_pos - vss->appl_ptr) % buf_size;
> 
> The variable declaration should be moved to the beginning of the
> function.
> 
> Also, there can be a overlap beyond runtime->boundary (which easily
> happens for 32bit apps), so the calculation can be a bit more complex
> with conditional.
> 

Should I use as an example `cs46xx_playback/capture_transfer()` which relies on
the `snd_pcm_indirect_playback/capture_transfer()`? It looks like it
does already that calculation.

Thanks, Matias.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/2] virtio_pci: Don't make an extra copy of cpu affinity mask

2023-10-24 Thread Xuan Zhuo
On Tue, 24 Oct 2023 10:17:19 +0200, Jakub Sitnicki  wrote:
> On Tue, Oct 24, 2023 at 10:31 AM +08, Xuan Zhuo wrote:
> > On Mon, 23 Oct 2023 18:52:45 +0200, Jakub Sitnicki  
> > wrote:
> >> On Thu, Oct 19, 2023 at 08:55 PM +08, Xuan Zhuo wrote:
> >> > On Thu, 19 Oct 2023 12:16:24 +0200, Jakub Sitnicki 
> >> >  wrote:
> >> >> Since commit 19e226e8cc5d ("virtio: Make vp_set_vq_affinity() take a
> >> >> mask.") it is actually not needed to have a local copy of the cpu mask.
> >> >
> >> >
> >> > Could you give more info to prove this?
> >
> >
> > Actually, my question is that can we pass a val on the stack(or temp value) 
> > to
> > the irq_set_affinity_hint()?
> >
> > Such as the virtio-net uses zalloc_cpumask_var to alloc a cpu_mask, and
> > that will be released.
> >
> >
> >
> > int __irq_apply_affinity_hint(unsigned int irq, const struct cpumask *m,
> >   bool setaffinity)
> > {
> > unsigned long flags;
> > struct irq_desc *desc = irq_get_desc_lock(irq, &flags, 
> > IRQ_GET_DESC_CHECK_GLOBAL);
> >
> > if (!desc)
> > return -EINVAL;
> > ->  desc->affinity_hint = m;
> > irq_put_desc_unlock(desc, flags);
> > if (m && setaffinity)
> > __irq_set_affinity(irq, m, false);
> > return 0;
> > }
> > EXPORT_SYMBOL_GPL(__irq_apply_affinity_hint);
> >
> > The above code directly refers the mask pointer. If the mask is a temp 
> > value, I
> > think that is a bug.
>
> You are completely right. irq_set_affinity_hint stores the mask pointer.
> irq_affinity_hint_proc_show later dereferences it when user reads out
> /proc/irq/*/affinity_hint.
>
> I have failed to notice that. That's why we need cpumask_copy to stay.
>
> My patch is buggy. Please disregard.
>
> I will send a v2 to only migrate from deprecated irq_set_affinity_hint.
>
> > And I notice that many places directly pass the temp value to this API.
> > And I am a little confused. ^_^ Or I missed something.
>
> There seem two be two gropus of callers:
>
> 1. Those that use get_cpu_mask/cpumask_of/cpumask_of_node to produce a
>cpumask pointer which is a preallocated constant.
>
>$ weggli 'irq_set_affinity_hint(_, $func(_));' ~/src/linux
>
> 2. Those that pass a pointer to memory somewhere.
>
>$ weggli 'irq_set_affinity_hint(_, $mask);' ~/src/linux
>
> (weggli tool can be found at https://github.com/weggli-rs/weggli)
>
> I've looked over the callers from group #2 but I couldn't find any
> passing a pointer memory on stack :-)

Pls check stmmac_request_irq_multi_msi()

Thanks.


>
> Thanks for pointing this out.
>
> [...]
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 4/4] vduse: Add LSM hooks to check Virtio device type

2023-10-24 Thread Maxime Coquelin




On 10/23/23 17:13, Casey Schaufler wrote:

On 10/23/2023 12:28 AM, Maxime Coquelin wrote:



On 10/21/23 00:20, Casey Schaufler wrote:

On 10/20/2023 8:58 AM, Maxime Coquelin wrote:

This patch introduces LSM hooks for devices creation,
destruction and opening operations, checking the
application is allowed to perform these operations for
the Virtio device type.


Why do you think that there needs to be a special LSM check for virtio
devices? What can't existing device attributes be used?


Michael asked for a way for SELinux to allow/prevent the creation of
some types of devices [0].

A device is created using ioctl() on VDUSE control chardev. Its type is
specified via a field in the structure passed in argument.

I didn't see other way than adding dedicated LSM hooks to achieve this,
but it is possible that their is a better way to do it?


At the very least the hook should be made more general, and I'd have to
see a proposal before commenting on that. security_dev_destroy(dev) might
be a better approach. If there's reason to control destruction of vduse
devices it's reasonable to assume that there are other devices with the
same or similar properties.


VDUSE is different from other devices as the device is actually
implemented by the user-space application, so this is very specific in
my opinion.



Since SELinux is your target use case, can you explain why you can't
create SELinux policy to enforce the restrictions you're after? I believe
(but can be proven wrong, of course) that SELinux has mechanism for dealing
with controls on ioctls.



I am not aware of such mechanism to deal with ioctl(), if you have a 
pointer that would be welcome.


Thanks,
Maxime





Thanks,
Maxime

[0]:
https://lore.kernel.org/all/20230829130430-mutt-send-email-...@kernel.org/





___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/2] virtio_pci: Don't make an extra copy of cpu affinity mask

2023-10-24 Thread Jakub Sitnicki via Virtualization
On Tue, Oct 24, 2023 at 10:31 AM +08, Xuan Zhuo wrote:
> On Mon, 23 Oct 2023 18:52:45 +0200, Jakub Sitnicki  
> wrote:
>> On Thu, Oct 19, 2023 at 08:55 PM +08, Xuan Zhuo wrote:
>> > On Thu, 19 Oct 2023 12:16:24 +0200, Jakub Sitnicki  
>> > wrote:
>> >> Since commit 19e226e8cc5d ("virtio: Make vp_set_vq_affinity() take a
>> >> mask.") it is actually not needed to have a local copy of the cpu mask.
>> >
>> >
>> > Could you give more info to prove this?
>
>
> Actually, my question is that can we pass a val on the stack(or temp value) to
> the irq_set_affinity_hint()?
>
> Such as the virtio-net uses zalloc_cpumask_var to alloc a cpu_mask, and
> that will be released.
>
>
>
>   int __irq_apply_affinity_hint(unsigned int irq, const struct cpumask *m,
> bool setaffinity)
>   {
>   unsigned long flags;
>   struct irq_desc *desc = irq_get_desc_lock(irq, &flags, 
> IRQ_GET_DESC_CHECK_GLOBAL);
>
>   if (!desc)
>   return -EINVAL;
> ->desc->affinity_hint = m;
>   irq_put_desc_unlock(desc, flags);
>   if (m && setaffinity)
>   __irq_set_affinity(irq, m, false);
>   return 0;
>   }
>   EXPORT_SYMBOL_GPL(__irq_apply_affinity_hint);
>
> The above code directly refers the mask pointer. If the mask is a temp value, 
> I
> think that is a bug.

You are completely right. irq_set_affinity_hint stores the mask pointer.
irq_affinity_hint_proc_show later dereferences it when user reads out
/proc/irq/*/affinity_hint.

I have failed to notice that. That's why we need cpumask_copy to stay.

My patch is buggy. Please disregard.

I will send a v2 to only migrate from deprecated irq_set_affinity_hint.

> And I notice that many places directly pass the temp value to this API.
> And I am a little confused. ^_^ Or I missed something.

There seem two be two gropus of callers:

1. Those that use get_cpu_mask/cpumask_of/cpumask_of_node to produce a
   cpumask pointer which is a preallocated constant.

   $ weggli 'irq_set_affinity_hint(_, $func(_));' ~/src/linux

2. Those that pass a pointer to memory somewhere.

   $ weggli 'irq_set_affinity_hint(_, $mask);' ~/src/linux

(weggli tool can be found at https://github.com/weggli-rs/weggli)

I've looked over the callers from group #2 but I couldn't find any
passing a pointer memory on stack :-)

Thanks for pointing this out.

[...]
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] vhost-vdpa: fix NULL pointer deref in _compat_vdpa_reset

2023-10-24 Thread Dragos Tatulea via Virtualization
On Mon, 2023-10-23 at 16:14 -0700, Si-Wei Liu wrote:
> As subject. There's a vhost_vdpa_reset() done earlier before
> vhost_dev is initialized via vhost_dev_init(), ending up with
> NULL pointer dereference. Fix is to check if vqs is initialized
> before checking backend features and resetting the device.
> 
>   BUG: kernel NULL pointer dereference, address: 
>   #PF: supervisor read access in kernel mode
>   #PF: error_code(0x) - not-present page
>   PGD 0 P4D 0
>   Oops:  [#1] SMP
>   CPU: 3 PID: 1727 Comm: qemu-system-x86 Not tainted 6.6.0-rc6+ #2
>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-
>   a4aeb02-prebuilt.qemu.org 04/01/2014
>   RIP: 0010:_compat_vdpa_reset+0x47/0xc0 [vhost_vdpa]
>   Code: c7 c7 fb 12 56 a0 4c 8d a5 b8 02 00 00 48 89 ea e8 7e b8 c4
>   48 89 ee 48 c7 c7 19 13 56 a0 4c 8b ad b0 02 00 00 <48> 8b 00 49
>   00 48 8b 80 88 45 00 00 48 c1 e8 08 48
>   RSP: 0018:8881063c3c38 EFLAGS: 00010246
>   RAX:  RBX: 8881074eb800 RCX: 
>   RDX:  RSI: 888103ab4000 RDI: a0561319
>   RBP: 888103ab4000 R08: dfff R09: 0001
>   R10: 0003 R11: 7fecbac0 R12: 888103ab42b8
>   R13: 888106dbe850 R14: 0003 R15: 8881074ebc18
>   FS:  7f02fba6ef00() GS:5f8c()
>   knlGS:
>   CS:  0010 DS:  ES:  CR0: 80050033
>   CR2:  CR3: 0001325e5003 CR4: 00372ea0
>   DR0:  DR1:  DR2: 
>   DR3:  DR6: fffe0ff0 DR7: 0400
>   Call Trace:
>    
>    ? __die+0x1f/0x60
>    ? page_fault_oops+0x14c/0x3b0
>    ? exc_page_fault+0x74/0x140
>    ? asm_exc_page_fault+0x22/0x30
>    ? _compat_vdpa_reset+0x47/0xc0 [vhost_vdpa]
>    ? _compat_vdpa_reset+0x32/0xc0 [vhost_vdpa]
>    vhost_vdpa_open+0x55/0x270 [vhost_vdpa]
>    ? sb_init_dio_done_wq+0x50/0x50
>    chrdev_open+0xc0/0x210
>    ? __unregister_chrdev+0x50/0x50
>    do_dentry_open+0x1fc/0x4f0
>    path_openat+0xc2d/0xf20
>    do_filp_open+0xb4/0x160
>    ? kmem_cache_alloc+0x3c/0x490
>    do_sys_openat2+0x8d/0xc0
>    __x64_sys_openat+0x6a/0xa0
>    do_syscall_64+0x3c/0x80
>    entry_SYSCALL_64_after_hwframe+0x46/0xb0
> 
> Fixes: 10cbf8dfaf93 ("vhost-vdpa: clean iotlb map during reset for older
> userspace")
> Reported-by: Dragos Tatulea 

This works, thanks for the quick response.

Tested-by: Dragos Tatulea 

> Closes:
> https://lore.kernel.org/all/b4913f84-8b52-4d28-af51-8573dc361...@oracle.com/
> Signed-off-by: Si-Wei Liu 
> ---
>  drivers/vhost/vdpa.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index 9ce40003793b..9a2343c45df0 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -232,9 +232,11 @@ static int _compat_vdpa_reset(struct vhost_vdpa *v)
> struct vdpa_device *vdpa = v->vdpa;
> u32 flags = 0;
>  
> -   flags |= !vhost_backend_has_feature(v->vdev.vqs[0],
> -   VHOST_BACKEND_F_IOTLB_PERSIST) ?
> -    VDPA_RESET_F_CLEAN_MAP : 0;
> +   if (v->vdev.vqs) {
> +   flags |= !vhost_backend_has_feature(v->vdev.vqs[0],
> +  
> VHOST_BACKEND_F_IOTLB_PERSIST) ?
> +    VDPA_RESET_F_CLEAN_MAP : 0;
> +   }
>  
> return vdpa_reset(vdpa, flags);
>  }

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC] vhost: vmap virtio descriptor table kernel/kvm

2023-10-24 Thread Liang Chen
On Tue, Oct 24, 2023 at 12:45 PM Jason Wang  wrote:
>
> On Tue, Oct 24, 2023 at 11:17 AM Liang Chen  wrote:
> >
> > The current vhost code uses 'copy_from_user' to copy descriptors from
> > userspace to vhost. We attempted to 'vmap' the descriptor table to
> > reduce the overhead associated with 'copy_from_user' during descriptor
> > access, because it can be accessed quite frequently. This change
> > resulted in a moderate performance improvement (approximately 3%) in
> > our pktgen test, as shown below. Additionally, the latency in the
> > 'vhost_get_vq_desc' function shows a noticeable decrease.
> >
> > current code:
> > IFACE   rxpck/s   txpck/srxkB/stxkB/s
> > rxcmp/s   txcmp/s  rxmcst/s   %ifutil
> > Average:vnet0  0.31 1330807.03  0.02  77976.98
> > 0.00  0.00  0.00  6.39
> > # /usr/share/bcc/tools/funclatency -d 10 vhost_get_vq_desc
> > avg = 145 nsecs, total: 1455980341 nsecs, count: 9985224
> >
> > vmap:
> > IFACE   rxpck/s   txpck/srxkB/stxkB/s
> > rxcmp/s   txcmp/s  rxmcst/s   %ifutil
> > Average:vnet0  0.07 1371870.49  0.00  80383.04
> > 0.00  0.00  0.00  6.58
> > # /usr/share/bcc/tools/funclatency -d 10 vhost_get_vq_desc
> > avg = 122 nsecs, total: 1286983929 nsecs, count: 10478134
> >
> > We are uncertain if there are any aspects we may have overlooked and
> > would appreciate any advice before we submit an actual patch.
>
> So the idea is to use a shadow page table instead of the userspace one
> to avoid things like spec barriers or SMAP.
>
> I've tried this in the past:
>
> commit 7f466032dc9e5a61217f22ea34b2df932786bbfc (HEAD)
> Author: Jason Wang 
> Date:   Fri May 24 04:12:18 2019 -0400
>
> vhost: access vq metadata through kernel virtual address
>
> It was noticed that the copy_to/from_user() friends that was used to
> access virtqueue metdata tends to be very expensive for dataplane
> implementation like vhost since it involves lots of software checks,
> speculation barriers, hardware feature toggling (e.g SMAP). The
> extra cost will be more obvious when transferring small packets since
> the time spent on metadata accessing become more significant.
> ...
>
> Note that it tries to use a direct map instead of a VMAP as Andrea
> suggests. But it led to several fallouts which were tricky to be
> fixed[1] (like the use of MMU notifiers to do synchronization). So it
> is reverted finally.
>
> I'm not saying it's a dead end. But we need to find a way to solve the
> issues or use something different. I'm happy to offer help.
>
> 1) Avoid using SMAP for vhost kthread, for example using shed
> notifier, I'm not sure this is possible or not
> 2) A new iov iterator that doesn't do SMAP at all, this looks
> dangerous and Al might not like it
> 3) (Re)using HMM
> ...
>
> You may want to see archives for more information. We've had a lot of
> discussions.
>
> Thanks
>
> [1] https://lore.kernel.org/lkml/20190731084655.7024-1-jasow...@redhat.com/
>

Thanks a lot for the help. We will take a deeper look and reach out if
we have any questions or once we've finalized the patch.

Thanks,
Liang

> >
> >
> > Thanks,
> > Liang
> >
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v3] vsock/virtio: initialize the_virtio_vsock before using VQs

2023-10-24 Thread Stefano Garzarella

On Mon, Oct 23, 2023 at 10:22:07PM +0300, Alexandru Matei wrote:

Once VQs are filled with empty buffers and we kick the host, it can send
connection requests. If the_virtio_vsock is not initialized before,
replies are silently dropped and do not reach the host.

virtio_transport_send_pkt() can queue packets once the_virtio_vsock is
set, but they won't be processed until vsock->tx_run is set to true. We
queue vsock->send_pkt_work when initialization finishes to send those
packets queued earlier.

Fixes: 0deab087b16a ("vsock/virtio: use RCU to avoid use-after-free on 
the_virtio_vsock")
Signed-off-by: Alexandru Matei 
---
v3:
- renamed vqs_fill to vqs_start and moved tx_run initialization to it
- queued send_pkt_work at the end of initialization to send packets queued 
earlier
v2:
- split virtio_vsock_vqs_init in vqs_init and vqs_fill and moved
 the_virtio_vsock initialization after vqs_init

net/vmw_vsock/virtio_transport.c | 13 +++--
1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index e95df847176b..c0333f9a8002 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -555,6 +555,11 @@ static int virtio_vsock_vqs_init(struct virtio_vsock 
*vsock)

virtio_device_ready(vdev);

+   return 0;
+}
+
+static void virtio_vsock_vqs_start(struct virtio_vsock *vsock)
+{
mutex_lock(&vsock->tx_lock);
vsock->tx_run = true;
mutex_unlock(&vsock->tx_lock);
@@ -568,8 +573,6 @@ static int virtio_vsock_vqs_init(struct virtio_vsock *vsock)
virtio_vsock_event_fill(vsock);
vsock->event_run = true;
mutex_unlock(&vsock->event_lock);
-
-   return 0;
}

static void virtio_vsock_vqs_del(struct virtio_vsock *vsock)
@@ -664,6 +667,9 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
goto out;

rcu_assign_pointer(the_virtio_vsock, vsock);
+   virtio_vsock_vqs_start(vsock);
+
+   queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);


I would move this call in virtio_vsock_vqs_start() adding also a comment 
on top, bringing back what you wrote in the commit. Something like this:


/* virtio_transport_send_pkt() can queue packets once
 * the_virtio_vsock is set, but they won't be processed until
 * vsock->tx_run is set to true. We queue vsock->send_pkt_work
 * when initialization finishes to send those packets queued
 * earlier.
 */

Just as a consideration, we don't need to queue the other workers (rx, 
event) because as long as we don't fill the queues with empty buffers, 
the host can't send us any notification. (We could add it in the comment 
if you want).


The rest LGTM!

Thanks,
Stefano



mutex_unlock(&the_virtio_vsock_mutex);

@@ -736,6 +742,9 @@ static int virtio_vsock_restore(struct virtio_device *vdev)
goto out;

rcu_assign_pointer(the_virtio_vsock, vsock);
+   virtio_vsock_vqs_start(vsock);
+
+   queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);

out:
mutex_unlock(&the_virtio_vsock_mutex);
--
2.25.1



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization