On Thu, Dec 18, 2025 at 08:51:41PM +0100, Harald Mommer wrote:
> Hello,
>
> On 12/14/25 16:25, Francesco Valla wrote:
>
> >>>> +/* Compare with m_can.c/m_can_echo_tx_event() */
>
> For the question whether some comments were originally more personal notes:
> Yes!
>
> This applies especially for the ones which state in which already accepted
> driver(s) was looked to get an idea how things may be expected to be done.
> Most of those comments should have served their purpose now.
>
> >>>> +static int virtio_can_read_tx_queue(struct virtqueue *vq)
> >>>> +{
> >>>> + struct virtio_can_priv *can_priv = vq->vdev->priv;
> >>>> + struct net_device *dev = can_priv->dev;
> >>>> + struct virtio_can_tx *can_tx_msg;
> >>>> + struct net_device_stats *stats;
> >>>> + unsigned long flags;
> >>>> + unsigned int len;
> >>>> + u8 result;
> >>>> +
> >>>> + stats = &dev->stats;
> >>>> +
> >>>> + /* Protect list and virtio queue operations */
> >>>> + spin_lock_irqsave(&can_priv->tx_lock, flags);
> >>>
> >>> The section below seems a pretty big one to protect behind a spin lock.
> >>>
> >>
> >> How can I split it?
> >>
> >
> > Question here is: what needs to be protected? As far as I can tell, the
> > only entity needing some kind of locking here is the queue, while both
> > ida_* and tx_inflight operations are already covered (the former by
> > design [1], the second because it's implemented using an atomic.
> >
> > If I'm not wrong (but I might be, so please double check) this can be
> > limited to:
> >
> > /* Protect queue operations */
> > scoped_guard(spinlock_irqsave, &priv->tx_lock)
> > err = virtqueue_add_sgs(vq, sgs, 1u, 1u, can_tx_msg,
> > GFP_ATOMIC);
> >
> >
> > Maybe the whole locking pattern is a leftover from a previous version,
> > where a list of TX messages was kept?
>
> 1.) There is virtqueue_get_buf() => virtqueue_get_buf_ctx() and there is a
> comment
> " * Caller must ensure we don't call this with other virtqueue
> * operations at the same time (except where noted)."
>
> Are we safe when at the same time in virtio_can_start_xmit() a queue
> operation is done in parallel?
> Locking may or may not be necessary here. I cannot tell in this moment.
you need a way to make sure add buf and get buf do not run
in parallel. virtio does not provide this protection itself.
up to the caller.
>
> 2.) There was once a "list_del(&can_tx_msg->list);" in the code here.
>
> When in virtio_can_start_xmit() at the same time a list_add_tail() or a
> list_del() would have been executed we had a garbled linked list.
>
> The linked list now does not exist any more in the newer code base.
>
> => could be that the lock is not needed any more at all
> => could be that we have to protect only the queue operations now and this
> would shorten the locking time and simplify the code
>
> >>>> +
> >>>> + can_tx_msg = virtqueue_get_buf(vq, &len);
> >>>> + if (!can_tx_msg) {
> >>>> + spin_unlock_irqrestore(&can_priv->tx_lock, flags);
> >>>> + return 0; /* No more data */
> >>>> + }
> >>>> +
> >>>> + if (unlikely(len < sizeof(struct virtio_can_tx_in))) {
> >>>> + netdev_err(dev, "TX ACK: Device sent no result code\n");
> >>>> + result = VIRTIO_CAN_RESULT_NOT_OK; /* Keep things going
> >>>> */
> >>>> + } else {
> >>>> + result = can_tx_msg->tx_in.result;
> >>>> + }
> >>>> +
> >
> > (snip)
> >
>
> >>>> + if (!priv->rpkt) {
> >>>> + virtio_can_del_vq(vdev);
> >>>> + goto on_failure;
> >>>> + }
> >>>> + virtio_can_populate_rx_vq(vdev);
> >>>> +
> >>>> + err = register_virtio_can_dev(dev);
> >>>> + if (err) {
> >>>> + virtio_can_del_vq(vdev);
> >>>> + goto on_failure;
> >>>> + }
> >>>> +
> >>>> + napi_enable(&priv->napi);
> >>>> + napi_enable(&priv->napi_tx);
> >>>
> >>> Most of the existing drivers enable the napi(s) during the open() phase,
> >>> IIUC to avoid scheduling napi operations for devices that might never
> >>> get used. But here maybe there is a specific reason to do it this way?
> >>>
> >>
> >> I do not have idea. I moved to open() and something stopped to work. I
> >> am investigating it.
> >>
> >
> > On a second thought, it may be wiser to have the napis enabled on probe,
> > to drop the incoming messages even when the interface is brought down.
>
> It's a while since then but I wanted to drop messages not having lurking a 3
> hours old cooling water temperature in some virtio message buffer being
> misinterpreted as an actual value. May have the disadvantage to cause load
> when the driver is not open-ed. But I see you also thought about 3 hours old
> outdated values now which may cause trouble.
>
> >
> > (last snip)
> >
> >
> > While stress testing this, I noticed that flooding the virtio-can
> > interface with packets leads to an hang of the interface itself.
> > I am seeing this issuing, at host side:
> >
> > while true; do cansend can0 123#00; done
> >
> > with:
> >
> > - QEMU: the tip of the master branch plus [2]
> > - vhost-device: the tip of the main branch
> >
> > and the following QEMU invocation:
> >
> > qemu-system-x86_64 -serial mon:stdio \
> > -m 2G -smp 2 \
> > -kernel $(pwd)/BUILD.bin/arch/x86/boot/bzImage \
> > -initrd /home/francesco/SRC/LINUX_KERNEL/initramfs.gz \
> > -append "loglevel=7 console=ttyS0" \
> > -machine memory-backend=pc.ram \
> > -object
> > memory-backend-file,id=pc.ram,size=2G,mem-path=/tmp/pc.ram,share=on \
> > -chardev socket,id=can0,path=/tmp/sock-can0 \
> > -device vhost-user-can-pci,chardev=can0
>
> I had this problem when I enabled the experimental feature late TX ACK on the
> device side instead of immediately sending the TX ack early even when the CAN
> message had not yet been transmitted on the (physical) bus. In this case I
> relied that no ACK message (own sent message received) was lost otherwise I
> ran out of messages in the transmit queue everything waiting until doomsday
> for ACKs which would never come.
>
> The problem was that somewhere in the Linux stack those acknowledgements got
> lost under heavy load on the device side. Workaround was to ack the TX
> message early (means putting the message immediately back into the used queue
> when received) in the virtio device. But this is a device thing, the device
> MUST put back ALL messages back into the used queue not forgetting about some
> under whatever circumstances otherwise the avail queue will get empty forever.
>
> Besides that I could do what I want stressing the code and it did not stop.
> But this code was different from what I see now, and the testing environment
> was also a different one.
>
> > Restarting the interface (i.e.: ip link set down and the up) does not
> > fix the situation.
> >
> > I'll try to do some more testing during the next days.
> Other than fixing the swapped feature flag values for the next release
> internally I've had not yet the chance to look deeply into all those changes
> and really to think about them in depth.
>