from:"Martin"

Re: [RFC v3 3/3] vhost: Allocate memory for packed vring

2024-09-12 Thread Eugenio Perez Martin

On Wed, Sep 11, 2024 at 9:36 PM Sahil  wrote:
>
> Hi,
>
> On Monday, September 9, 2024 6:04:45 PM GMT+5:30 Eugenio Perez Martin wrote:
> > On Sun, Sep 8, 2024 at 9:47 PM Sahil  wrote:
> > > On Friday, August 30, 2024 4:18:31 PM GMT+5:30 Eugenio Perez Martin wrote:
> > > > On Fri, Aug 30, 2024 at 12:20 PM Sahil  wrote:
> > > > [...]
> > > > vdpa_sim does not support packed vq at the moment. You need to build
> > > > the use case #3 of the second part of that blog [1]. It's good that
> > > > you build the vdpa_sim earlier as it is a simpler setup.
> > > >
> > > > If you have problems with the vp_vdpa environment please let me know
> > > > so we can find alternative setups.
> > >
> > > Thank you for the clarification. I tried setting up the vp_vdpa
> > > environment (scenario 3) but I ended up running into a problem
> > > in the L1 VM.
> > >
> > > I verified that nesting is enabled in KVM (L0):
> > >
> > > $ grep -oE "(vmx|svm)" /proc/cpuinfo | sort | uniq
> > > vmx
> > >
> > > $ cat /sys/module/kvm_intel/parameters/nested
> > > Y
> > >
> > > There are no issues when booting L1. I start the VM by running:
> > >
> > > $ sudo ./qemu/build/qemu-system-x86_64 \
> > > -enable-kvm \
> > > -drive file=//home/ig91/fedora_qemu_test_vm/L1.qcow2,media=disk,if=virtio
> > > \
> > > -net nic,model=virtio \
> > > -net user,hostfwd=tcp::-:22 \
> > > -device intel-iommu,snoop-control=on \
> > > -device
> > > virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_pla
> > > tform=on,event_idx=off,packed=on,bus=pcie.0,addr=0x4 \ -netdev
> > > tap,id=net0,script=no,downscript=no \
> > > -nographic \
> > > -m 2G \
> > > -smp 2 \
> > > -M q35 \
> > > -cpu host \
> > > 2>&1 | tee vm.log
> > >
> > > Kernel version in L1:
> > >
> > > # uname -a
> > > Linux fedora 6.8.5-201.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Apr 11
> > > 18:25:26 UTC 2024 x86_64 GNU/Linux
> > Did you run the kernels with the arguments "iommu=pt intel_iommu=on"?
> > You can print them with cat /proc/cmdline.
>
> I missed this while setting up the environment. After setting the kernel
> params I managed to move past this issue but my environment in virtualbox
> was very unstable and it kept crashing.
>

I've no experience with virtualbox+vdpa, sorry :). Why not use QEMU also for L1?

> I managed to get L1 to run on my host OS, so scenario 3 is now up and
> running. However, the packed bit seems to be disabled in this scenario too.
>
> L0 (host machine) specs:
> - kernel version:
>   6.6.46-1-lts
>
> - QEMU version:
>   9.0.50 (v8.2.0-5536-g16514611dc)
>
> - vDPA version:
>   iproute2-6.10.0
>
> L1 specs:
>
> - kernel version:
>   6.8.5-201.fc39.x86_64
>
> - QEMU version:
>   9.0.91
>
> - vDPA version:
>   iproute2-6.10.0
>
> L2 specs:
> - kernel version
>   6.8.7-200.fc39.x86_64
>
> I followed the following steps to set up scenario 3:
>
>  In L0 
>
> $ grep -oE "(vmx|svm)" /proc/cpuinfo | sort | uniq
> vmx
>
> $ cat /sys/module/kvm_intel/parameters/nested
> Y
>
> $ sudo ./qemu/build/qemu-system-x86_64 \
> -enable-kvm \
> -drive 
> file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
> -net nic,model=virtio \
> -net user,hostfwd=tcp::-:22 \
> -device intel-iommu,snoop-control=on \
> -device 
> virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,event_idx=off,packed=on,bus=pcie.0,addr=0x4
>  \
> -netdev tap,id=net0,script=no,downscript=no \
> -nographic \
> -m 8G \
> -smp 4 \
> -M q35 \
> -cpu host \
> 2>&1 | tee vm.log
>
>  In L1 
>
> I verified that the following config variables are set as decribed in the 
> blog [1].
>
> CONFIG_VIRTIO_VDPA=m
> CONFIG_VDPA=m
> CONFIG_VP_VDPA=m
> CONFIG_VHOST_VDPA=m
>
> # modprobe vdpa
> # modprobe vhost_vdpa
> # modprobe vp_vdpa
>
> # lsmod | grep -i vdpa
> vp_vdpa 20480  0
> vhost_vdpa  32768  0
> vhost   65536  1 vhost_vdpa
> vhost_iotlb 16384  2 vhost_vdpa,vhost
> vdpa36864  2 vp_vdpa,vhost_vdpa
> irqbypass   12288  2 vhost_vdpa,kvm
>
> # lspci | grep -i ethernet
> 00:04.0 Ethernet controller: Red Hat, Inc. Virtio 1.0 network device (rev 01)
>
> # lspci -nn |

Re: [PATCH 0/2] Move net backend cleanup to NIC cleanup

2024-09-11 Thread Eugenio Perez Martin

On Wed, Sep 11, 2024 at 11:04 AM Eugenio Perez Martin
 wrote:
>
> On Tue, Sep 10, 2024 at 5:46 AM Jason Wang  wrote:
> >
> > On Tue, Sep 10, 2024 at 11:41 AM Si-Wei Liu  wrote:
> > >
> > > Hi Jason,
> > >
> > > It seems this series wasn't applied successfully, I still cannot see it
> > > from the latest tree. Any idea?
> >
> > It breaks make check.
> >
> > Eugenio, would you want to fix and resend the series?
> >
>
> I'm trying to reproduce but with no luck :(.
>

I'm able to reproduce consistently now.

> For the record this is the failed log. Is it possible to try to
> reproduce it again in the machine / env it crashed?
>
> ▶  10/354 ERROR:../tests/qtest/qos-test.
> c:191:subprocess_run_one_test:
> child process 
> (/x86_64/pc/i440FX-pcihost/pci-bus-pc/pci-bus/virtio-net-pci/virtio-net/virtio-net-tests/vhost-user/migrate/subprocess
> [1494462]) failed unexpectedly ERROR
>  10/354 qemu:qtest+qtest-x86_64 / qtest-x86_64/qos-test
> ERROR   14.19s   killed by signal 6 SIGABRT
> >>> PYTHON=/home/devel/git/qemu/build/pyvenv/bin/python3 
> >>> G_TEST_DBUS_DAEMON=/home/devel/git/qemu/tests/dbus-vmstate-daemon.sh 
> >>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon 
> >>> QTEST_QEMU_IMG=./qemu-img QTEST_QEMU_BINARY=./qemu-system-x86_64 
> >>> MALLOC_PERTURB_=82 /home/devel/git/qemu/build/tests/qtest/qos-test --tap 
> >>> -k
> ―――
> ✀  
> ―――
> stderr:
> Vhost user backend fails to broadcast fake RARP
> ../tests/qtest/libqtest.c:204: kill_qemu() detected QEMU death from
> signal 11 (Segmentation fault) (core dumped)
> ../tests/qtest/libqtest.c:204: kill_qemu() detected QEMU death from
> signal 11 (Segmentation fault) (core dumped)
> **
> ERROR:../tests/qtest/qos-test.c:191:subprocess_run_one_test: child
> process 
> (/x86_64/pc/i440FX-pcihost/pci-bus-pc/pci-bus/virtio-net-pci/virtio-net/virtio-net-tests/vhost-user/migrate/subprocess
> [1494462]) failed unexpectedly
> > Thanks
> >
> > >
> > > In any case the fix LGTM.
> > >
> > > Reviewed-by: Si-Wei Liu 
> > >
> > > Thanks,
> > > -Siwei
> > >
> > > On 1/31/2024 9:43 PM, Jason Wang wrote:
> > > > On Mon, Jan 29, 2024 at 9:24 PM Eugenio Pérez  
> > > > wrote:
> > > >> Commit a0d7215e33 ("vhost-vdpa: do not cleanup the vdpa/vhost-net
> > > >> structures if peer nic is present") effectively delayed the backend
> > > >> cleanup, allowing the frontend or the guest to access it resources as
> > > >> long as the frontend NIC is still visible to the guest.
> > > >>
> > > >> However it does not clean up the resources until the qemu process is
> > > >> over.  This causes an effective leak if the device is deleted with
> > > >> device_del, as there is no way to close the vdpa device.  This makes
> > > >> impossible to re-add that device to this or other QEMU instances until
> > > >> the first instance of QEMU is finished.
> > > >>
> > > >> Move the cleanup from qemu_cleanup to the NIC deletion.
> > > >>
> > > >> Fixes: a0d7215e33 ("vhost-vdpa: do not cleanup the vdpa/vhost-net 
> > > >> structures if peer nic is present")
> > > >> Acked-by: Jason Wang 
> > > >> Reported-by: Lei Yang 
> > > >> Signed-off-by: Eugenio Pérez 
> > > >>
> > > >> Eugenio Pérez (2):
> > > >>net: parameterize the removing client from nc list
> > > >>net: move backend cleanup to NIC cleanup
> > > >>
> > > >>   net/net.c| 30 --
> > > >>   net/vhost-vdpa.c |  8 
> > > >>   2 files changed, 20 insertions(+), 18 deletions(-)
> > > >>
> > > >> --
> > > > Queued.
> > > >
> > > > Thanks
> > > >
> > >
> >

Re: [RFC 1/2] vhost-vdpa: Decouple the IOVA allocator

2024-09-11 Thread Eugenio Perez Martin

On Wed, Sep 11, 2024 at 11:06 AM Si-Wei Liu  wrote:
>
>
>
> On 9/9/2024 11:22 PM, Eugenio Perez Martin wrote:
> > On Tue, Sep 10, 2024 at 7:30 AM Si-Wei Liu  wrote:
> >> Sorry for the delayed response, it seems I missed the email reply for
> >> some reason during the long weekend.
> >>
> >> On 9/2/2024 4:09 AM, Eugenio Perez Martin wrote:
> >>> On Fri, Aug 30, 2024 at 11:05 PM Si-Wei Liu  wrote:
> >>>>
> >>>> On 8/30/2024 1:05 AM, Eugenio Perez Martin wrote:
> >>>>> On Fri, Aug 30, 2024 at 6:20 AM Si-Wei Liu  
> >>>>> wrote:
> >>>>>> On 8/29/2024 9:53 AM, Eugenio Perez Martin wrote:
> >>>>>>> On Wed, Aug 21, 2024 at 2:56 PM Jonah Palmer 
> >>>>>>>  wrote:
> >>>>>>>> Decouples the IOVA allocator from the IOVA->HVA tree and instead adds
> >>>>>>>> the allocated IOVA range to an IOVA-only tree (iova_map). This IOVA 
> >>>>>>>> tree
> >>>>>>>> will hold all IOVA ranges that have been allocated (e.g. in the
> >>>>>>>> IOVA->HVA tree) and are removed when any IOVA ranges are deallocated.
> >>>>>>>>
> >>>>>>>> A new API function vhost_iova_tree_insert() is also created to add a
> >>>>>>>> IOVA->HVA mapping into the IOVA->HVA tree.
> >>>>>>>>
> >>>>>>> I think this is a good first iteration but we can take steps to
> >>>>>>> simplify it. Also, it is great to be able to make points on real code
> >>>>>>> instead of designs on the air :).
> >>>>>>>
> >>>>>>> I expected a split of vhost_iova_tree_map_alloc between the current
> >>>>>>> vhost_iova_tree_map_alloc and vhost_iova_tree_map_alloc_gpa, or
> >>>>>>> similar. Similarly, a vhost_iova_tree_remove and
> >>>>>>> vhost_iova_tree_remove_gpa would be needed.
> >>>>>>>
> >>>>>>> The first one is used for regions that don't exist in the guest, like
> >>>>>>> SVQ vrings or CVQ buffers. The second one is the one used by the
> >>>>>>> memory listener to map the guest regions into the vdpa device.
> >>>>>>>
> >>>>>>> Implementation wise, only two trees are actually needed:
> >>>>>>> * Current iova_taddr_map that contains all IOVA->vaddr translations as
> >>>>>>> seen by the device, so both allocation functions can work on a single
> >>>>>>> tree. The function iova_tree_find_iova keeps using this one, so the
> >>>>>> I thought we had thorough discussion about this and agreed upon the
> >>>>>> decoupled IOVA allocator solution.
> >>>>> My interpretation of it is to leave the allocator as the current one,
> >>>>> and create a new tree with GPA which is guaranteed to be unique. But
> >>>>> we can talk over it of course.
> >>>>>
> >>>>>> But maybe I missed something earlier,
> >>>>>> I am not clear how come this iova_tree_find_iova function could still
> >>>>>> work with the full IOVA-> HVA tree when it comes to aliased memory or
> >>>>>> overlapped HVAs? Granted, for the memory map removal in the
> >>>>>> .region_del() path, we could rely on the GPA tree to locate the
> >>>>>> corresponding IOVA, but how come the translation path could figure out
> >>>>>> which IOVA range to return when the vaddr happens to fall in an
> >>>>>> overlapped HVA range?
> >>>>> That is not a problem, as they both translate to the same address at 
> >>>>> the device.
> >>>> Not sure I followed, it might return a wrong IOVA (range) which the host
> >>>> kernel may have conflict or unmatched attribute i.e. permission, size et
> >>>> al in the map.
> >>>>
> >>> Let's leave out the permissions at the moment. I'm going to use the
> >>> example you found, but I'll reorder (1) and (3) insertions so it picks
> >>> the "wrong" IOVA range intentionally:
> >>>
> >>> (1)
> >>> HVA: [0x7f7903ea, 0x7f7903ec)
> >>> GPA: [0xfeda, 0xfedc)
> >>> IO

Re: [PATCH 0/2] Move net backend cleanup to NIC cleanup

2024-09-11 Thread Eugenio Perez Martin

On Tue, Sep 10, 2024 at 5:46 AM Jason Wang  wrote:
>
> On Tue, Sep 10, 2024 at 11:41 AM Si-Wei Liu  wrote:
> >
> > Hi Jason,
> >
> > It seems this series wasn't applied successfully, I still cannot see it
> > from the latest tree. Any idea?
>
> It breaks make check.
>
> Eugenio, would you want to fix and resend the series?
>

I'm trying to reproduce but with no luck :(.

For the record this is the failed log. Is it possible to try to
reproduce it again in the machine / env it crashed?

▶  10/354 ERROR:../tests/qtest/qos-test.
c:191:subprocess_run_one_test:
child process 
(/x86_64/pc/i440FX-pcihost/pci-bus-pc/pci-bus/virtio-net-pci/virtio-net/virtio-net-tests/vhost-user/migrate/subprocess
[1494462]) failed unexpectedly ERROR
 10/354 qemu:qtest+qtest-x86_64 / qtest-x86_64/qos-test
ERROR   14.19s   killed by signal 6 SIGABRT
>>> PYTHON=/home/devel/git/qemu/build/pyvenv/bin/python3 
>>> G_TEST_DBUS_DAEMON=/home/devel/git/qemu/tests/dbus-vmstate-daemon.sh 
>>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon 
>>> QTEST_QEMU_IMG=./qemu-img QTEST_QEMU_BINARY=./qemu-system-x86_64 
>>> MALLOC_PERTURB_=82 /home/devel/git/qemu/build/tests/qtest/qos-test --tap -k
―――
✀  
―――
stderr:
Vhost user backend fails to broadcast fake RARP
../tests/qtest/libqtest.c:204: kill_qemu() detected QEMU death from
signal 11 (Segmentation fault) (core dumped)
../tests/qtest/libqtest.c:204: kill_qemu() detected QEMU death from
signal 11 (Segmentation fault) (core dumped)
**
ERROR:../tests/qtest/qos-test.c:191:subprocess_run_one_test: child
process 
(/x86_64/pc/i440FX-pcihost/pci-bus-pc/pci-bus/virtio-net-pci/virtio-net/virtio-net-tests/vhost-user/migrate/subprocess
[1494462]) failed unexpectedly
> Thanks
>
> >
> > In any case the fix LGTM.
> >
> > Reviewed-by: Si-Wei Liu 
> >
> > Thanks,
> > -Siwei
> >
> > On 1/31/2024 9:43 PM, Jason Wang wrote:
> > > On Mon, Jan 29, 2024 at 9:24 PM Eugenio Pérez  wrote:
> > >> Commit a0d7215e33 ("vhost-vdpa: do not cleanup the vdpa/vhost-net
> > >> structures if peer nic is present") effectively delayed the backend
> > >> cleanup, allowing the frontend or the guest to access it resources as
> > >> long as the frontend NIC is still visible to the guest.
> > >>
> > >> However it does not clean up the resources until the qemu process is
> > >> over.  This causes an effective leak if the device is deleted with
> > >> device_del, as there is no way to close the vdpa device.  This makes
> > >> impossible to re-add that device to this or other QEMU instances until
> > >> the first instance of QEMU is finished.
> > >>
> > >> Move the cleanup from qemu_cleanup to the NIC deletion.
> > >>
> > >> Fixes: a0d7215e33 ("vhost-vdpa: do not cleanup the vdpa/vhost-net 
> > >> structures if peer nic is present")
> > >> Acked-by: Jason Wang 
> > >> Reported-by: Lei Yang 
> > >> Signed-off-by: Eugenio Pérez 
> > >>
> > >> Eugenio Pérez (2):
> > >>net: parameterize the removing client from nc list
> > >>net: move backend cleanup to NIC cleanup
> > >>
> > >>   net/net.c| 30 --
> > >>   net/vhost-vdpa.c |  8 
> > >>   2 files changed, 20 insertions(+), 18 deletions(-)
> > >>
> > >> --
> > > Queued.
> > >
> > > Thanks
> > >
> >
>

Re: [RFC 1/2] vhost-vdpa: Decouple the IOVA allocator

2024-09-09 Thread Eugenio Perez Martin

On Tue, Sep 10, 2024 at 7:30 AM Si-Wei Liu  wrote:
>
> Sorry for the delayed response, it seems I missed the email reply for
> some reason during the long weekend.
>
> On 9/2/2024 4:09 AM, Eugenio Perez Martin wrote:
> > On Fri, Aug 30, 2024 at 11:05 PM Si-Wei Liu  wrote:
> >>
> >>
> >> On 8/30/2024 1:05 AM, Eugenio Perez Martin wrote:
> >>> On Fri, Aug 30, 2024 at 6:20 AM Si-Wei Liu  wrote:
> >>>>
> >>>> On 8/29/2024 9:53 AM, Eugenio Perez Martin wrote:
> >>>>> On Wed, Aug 21, 2024 at 2:56 PM Jonah Palmer  
> >>>>> wrote:
> >>>>>> Decouples the IOVA allocator from the IOVA->HVA tree and instead adds
> >>>>>> the allocated IOVA range to an IOVA-only tree (iova_map). This IOVA 
> >>>>>> tree
> >>>>>> will hold all IOVA ranges that have been allocated (e.g. in the
> >>>>>> IOVA->HVA tree) and are removed when any IOVA ranges are deallocated.
> >>>>>>
> >>>>>> A new API function vhost_iova_tree_insert() is also created to add a
> >>>>>> IOVA->HVA mapping into the IOVA->HVA tree.
> >>>>>>
> >>>>> I think this is a good first iteration but we can take steps to
> >>>>> simplify it. Also, it is great to be able to make points on real code
> >>>>> instead of designs on the air :).
> >>>>>
> >>>>> I expected a split of vhost_iova_tree_map_alloc between the current
> >>>>> vhost_iova_tree_map_alloc and vhost_iova_tree_map_alloc_gpa, or
> >>>>> similar. Similarly, a vhost_iova_tree_remove and
> >>>>> vhost_iova_tree_remove_gpa would be needed.
> >>>>>
> >>>>> The first one is used for regions that don't exist in the guest, like
> >>>>> SVQ vrings or CVQ buffers. The second one is the one used by the
> >>>>> memory listener to map the guest regions into the vdpa device.
> >>>>>
> >>>>> Implementation wise, only two trees are actually needed:
> >>>>> * Current iova_taddr_map that contains all IOVA->vaddr translations as
> >>>>> seen by the device, so both allocation functions can work on a single
> >>>>> tree. The function iova_tree_find_iova keeps using this one, so the
> >>>> I thought we had thorough discussion about this and agreed upon the
> >>>> decoupled IOVA allocator solution.
> >>> My interpretation of it is to leave the allocator as the current one,
> >>> and create a new tree with GPA which is guaranteed to be unique. But
> >>> we can talk over it of course.
> >>>
> >>>> But maybe I missed something earlier,
> >>>> I am not clear how come this iova_tree_find_iova function could still
> >>>> work with the full IOVA-> HVA tree when it comes to aliased memory or
> >>>> overlapped HVAs? Granted, for the memory map removal in the
> >>>> .region_del() path, we could rely on the GPA tree to locate the
> >>>> corresponding IOVA, but how come the translation path could figure out
> >>>> which IOVA range to return when the vaddr happens to fall in an
> >>>> overlapped HVA range?
> >>> That is not a problem, as they both translate to the same address at the 
> >>> device.
> >> Not sure I followed, it might return a wrong IOVA (range) which the host
> >> kernel may have conflict or unmatched attribute i.e. permission, size et
> >> al in the map.
> >>
> > Let's leave out the permissions at the moment. I'm going to use the
> > example you found, but I'll reorder (1) and (3) insertions so it picks
> > the "wrong" IOVA range intentionally:
> >
> > (1)
> > HVA: [0x7f7903ea, 0x7f7903ec)
> > GPA: [0xfeda, 0xfedc)
> > IOVA: [0x1000, 0x21000)
> >
> > (2)
> > HVA: [0x7f7983e0, 0x7f9903e0)
> > GPA: [0x1, 0x208000)
> > IOVA: [0x80001000, 0x201000)
> >
> > (3)
> > HVA: [0x7f7903e0, 0x7f7983e0)
> > GPA: [0x0, 0x8000)
> > IOVA: [0x201000, 0x208000)
> >
> > Let's say that SVQ wants to translate the HVA range
> > 0xfeda-0xfedd. So it makes available for the device two
> > chained buffers: One with addr=0x1000 len=0x2 and the other one
> > with addr=(0x2c1000 len=0x1).
> >
> > The VirtIO device

Re: [RFC v3 3/3] vhost: Allocate memory for packed vring

2024-09-09 Thread Eugenio Perez Martin

On Sun, Sep 8, 2024 at 9:47 PM Sahil  wrote:
>
> Hi,
>
> On Friday, August 30, 2024 4:18:31 PM GMT+5:30 Eugenio Perez Martin wrote:
> > On Fri, Aug 30, 2024 at 12:20 PM Sahil  wrote:
> > > Hi,
> > >
> > > On Tuesday, August 27, 2024 9:00:36 PM GMT+5:30 Eugenio Perez Martin 
> > > wrote:
> > > > On Wed, Aug 21, 2024 at 2:20 PM Sahil  wrote:
> > > > > [...]
> > > > > I have been trying to test my changes so far as well. I am not very
> > > > > clear
> > > > > on a few things.
> > > > >
> > > > > Q1.
> > > > > I built QEMU from source with my changes and followed the vdpa_sim +
> > > > > vhost_vdpa tutorial [1]. The VM seems to be running fine. How do I
> > > > > check
> > > > > if the packed format is being used instead of the split vq format for
> > > > > shadow virtqueues? I know the packed format is used when virtio_vdev
> > > > > has
> > > > > got the VIRTIO_F_RING_PACKED bit enabled. Is there a way of checking
> > > > > that
> > > > > this is the case?
> > > >
> > > > You can see the features that the driver acked from the guest by
> > > > checking sysfs. Once you know the PCI BFN from lspci:
> > > > # lspci -nn|grep '\[1af4:1041\]'
> > > > 01:00.0 Ethernet controller [0200]: Red Hat, Inc. Virtio 1.0 network
> > > > device [1af4:1041] (rev 01)
> > > > # cut -c 35
> > > > /sys/devices/pci:00/:00:02.0/:01:00.0/virtio0/features 0
> > > >
> > > > Also, you can check from QEMU by simply tracing if your functions are
> > > > being called.
> > > >
> > > > > Q2.
> > > > > What's the recommended way to see what's going on under the hood? I
> > > > > tried
> > > > > using the -D option so QEMU's logs are written to a file but the file
> > > > > was
> > > > > empty. Would using qemu with -monitor stdio or attaching gdb to the
> > > > > QEMU
> > > > > VM be worthwhile?
> > > >
> > > > You need to add --trace options with the regex you want to get to
> > > > enable any output. For example, --trace 'vhost_vdpa_*' print all the
> > > > trace_vhost_vdpa_* functions.
> > > >
> > > > If you want to speed things up, you can just replace the interesting
> > > > trace_... functions with fprintf(stderr, ...). We can add the trace
> > > > ones afterwards.
> > >
> > > Understood. I am able to trace the functions that are being called with
> > > fprintf. I'll stick with fprintf for now.
> > >
> > > I realized that packed vqs are not being used in the test environment. I
> > > see that in "hw/virtio/vhost-shadow-virtqueue.c", svq->is_packed is set
> > > to 0 and that calls vhost_svq_add_split(). I am not sure how one enables
> > > the packed feature bit. I don't know if this is an environment issue.
> > >
> > > I built qemu from the latest source with my changes on top of it. I
> > > followed this article [1] to set up the environment.
> > >
> > > On the host machine:
> > >
> > > $ uname -a
> > > Linux fedora 6.10.5-100.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Aug 14
> > > 15:49:25 UTC 2024 x86_64 GNU/Linux
> > >
> > > $ ./qemu/build/qemu-system-x86_64 --version
> > > QEMU emulator version 9.0.91
> > >
> > > $ vdpa -V
> > > vdpa utility, iproute2-6.4.0
> > >
> > > All the relevant vdpa modules have been loaded in accordance with [1].
> > >
> > > $ lsmod | grep -iE "(vdpa|virtio)"
> > > vdpa_sim_net12288  0
> > > vdpa_sim24576  1 vdpa_sim_net
> > > vringh  32768  2 vdpa_sim,vdpa_sim_net
> > > vhost_vdpa  32768  2
> > > vhost   65536  1 vhost_vdpa
> > > vhost_iotlb 16384  4 vdpa_sim,vringh,vhost_vdpa,vhost
> > > vdpa36864  3 vdpa_sim,vhost_vdpa,vdpa_sim_net
> > >
> > > $ ls -l /sys/bus/vdpa/devices/vdpa0/driver
> > > lrwxrwxrwx. 1 root root 0 Aug 30 11:25 /sys/bus/vdpa/devices/vdpa0/driver
> > > -> ../../bus/vdpa/drivers/vhost_vdpa
> > >
> > > In the output of the following command, I see ANY_LAYOUT is supported.
> > > According to virtio_config.h [2] in the linux kernel, this r

Re: [RFC 1/2] vhost-vdpa: Decouple the IOVA allocator

2024-09-02 Thread Eugenio Perez Martin

On Fri, Aug 30, 2024 at 11:05 PM Si-Wei Liu  wrote:
>
>
>
> On 8/30/2024 1:05 AM, Eugenio Perez Martin wrote:
> > On Fri, Aug 30, 2024 at 6:20 AM Si-Wei Liu  wrote:
> >>
> >>
> >> On 8/29/2024 9:53 AM, Eugenio Perez Martin wrote:
> >>> On Wed, Aug 21, 2024 at 2:56 PM Jonah Palmer  
> >>> wrote:
> >>>> Decouples the IOVA allocator from the IOVA->HVA tree and instead adds
> >>>> the allocated IOVA range to an IOVA-only tree (iova_map). This IOVA tree
> >>>> will hold all IOVA ranges that have been allocated (e.g. in the
> >>>> IOVA->HVA tree) and are removed when any IOVA ranges are deallocated.
> >>>>
> >>>> A new API function vhost_iova_tree_insert() is also created to add a
> >>>> IOVA->HVA mapping into the IOVA->HVA tree.
> >>>>
> >>> I think this is a good first iteration but we can take steps to
> >>> simplify it. Also, it is great to be able to make points on real code
> >>> instead of designs on the air :).
> >>>
> >>> I expected a split of vhost_iova_tree_map_alloc between the current
> >>> vhost_iova_tree_map_alloc and vhost_iova_tree_map_alloc_gpa, or
> >>> similar. Similarly, a vhost_iova_tree_remove and
> >>> vhost_iova_tree_remove_gpa would be needed.
> >>>
> >>> The first one is used for regions that don't exist in the guest, like
> >>> SVQ vrings or CVQ buffers. The second one is the one used by the
> >>> memory listener to map the guest regions into the vdpa device.
> >>>
> >>> Implementation wise, only two trees are actually needed:
> >>> * Current iova_taddr_map that contains all IOVA->vaddr translations as
> >>> seen by the device, so both allocation functions can work on a single
> >>> tree. The function iova_tree_find_iova keeps using this one, so the
> >> I thought we had thorough discussion about this and agreed upon the
> >> decoupled IOVA allocator solution.
> > My interpretation of it is to leave the allocator as the current one,
> > and create a new tree with GPA which is guaranteed to be unique. But
> > we can talk over it of course.
> >
> >> But maybe I missed something earlier,
> >> I am not clear how come this iova_tree_find_iova function could still
> >> work with the full IOVA-> HVA tree when it comes to aliased memory or
> >> overlapped HVAs? Granted, for the memory map removal in the
> >> .region_del() path, we could rely on the GPA tree to locate the
> >> corresponding IOVA, but how come the translation path could figure out
> >> which IOVA range to return when the vaddr happens to fall in an
> >> overlapped HVA range?
> > That is not a problem, as they both translate to the same address at the 
> > device.
> Not sure I followed, it might return a wrong IOVA (range) which the host
> kernel may have conflict or unmatched attribute i.e. permission, size et
> al in the map.
>

Let's leave out the permissions at the moment. I'm going to use the
example you found, but I'll reorder (1) and (3) insertions so it picks
the "wrong" IOVA range intentionally:

(1)
HVA: [0x7f7903ea, 0x7f7903ec)
GPA: [0xfeda, 0xfedc)
IOVA: [0x1000, 0x21000)

(2)
HVA: [0x7f7983e0, 0x7f9903e0)
GPA: [0x1, 0x208000)
IOVA: [0x80001000, 0x201000)

(3)
HVA: [0x7f7903e0, 0x7f7983e0)
GPA: [0x0, 0x8000)
IOVA: [0x201000, 0x208000)

Let's say that SVQ wants to translate the HVA range
0xfeda-0xfedd. So it makes available for the device two
chained buffers: One with addr=0x1000 len=0x2 and the other one
with addr=(0x2c1000 len=0x1).

The VirtIO device should be able to translate these two buffers in
isolation and chain them. Not optimal but it helps to keep QEMU source
clean, as the device already must support it. I don't foresee lots of
cases like this anyway :).

About the permissions, maybe we can make the permissions to be part of
the lookup? Instead of returning them at iova_tree_find_iova, make
them match at iova_tree_find_address_iterator.

> >
> > The most complicated situation is where we have a region contained in
> > another region, and the requested buffer crosses them. If the IOVA
> > tree returns the inner region, it will return the buffer chained with
> > the rest of the content in the outer region. Not optimal, but solved
> > either way.
> Don't quite understand what it means... So in this overlapping case,
> speaking of the expectation of the translation API, you would like to
> have all IOVA

Re: [RFC 1/2] vhost-vdpa: Decouple the IOVA allocator

2024-08-30 Thread Eugenio Perez Martin

On Fri, Aug 30, 2024 at 3:52 PM Jonah Palmer  wrote:
>
>
>
> On 8/30/24 4:05 AM, Eugenio Perez Martin wrote:
> > On Fri, Aug 30, 2024 at 6:20 AM Si-Wei Liu  wrote:
> >>
> >>
> >>
> >> On 8/29/2024 9:53 AM, Eugenio Perez Martin wrote:
> >>> On Wed, Aug 21, 2024 at 2:56 PM Jonah Palmer  
> >>> wrote:
> >>>> Decouples the IOVA allocator from the IOVA->HVA tree and instead adds
> >>>> the allocated IOVA range to an IOVA-only tree (iova_map). This IOVA tree
> >>>> will hold all IOVA ranges that have been allocated (e.g. in the
> >>>> IOVA->HVA tree) and are removed when any IOVA ranges are deallocated.
> >>>>
> >>>> A new API function vhost_iova_tree_insert() is also created to add a
> >>>> IOVA->HVA mapping into the IOVA->HVA tree.
> >>>>
> >>> I think this is a good first iteration but we can take steps to
> >>> simplify it. Also, it is great to be able to make points on real code
> >>> instead of designs on the air :).
> >>>
>
> I can add more comments in the code if this is what you mean, no problem!
>

No action needed about this feedback :). I just meant that it will be
easier to iterate on code than designing just by talking at this
stage.

> >>> I expected a split of vhost_iova_tree_map_alloc between the current
> >>> vhost_iova_tree_map_alloc and vhost_iova_tree_map_alloc_gpa, or
> >>> similar. Similarly, a vhost_iova_tree_remove and
> >>> vhost_iova_tree_remove_gpa would be needed.
> >>> >>> The first one is used for regions that don't exist in the guest, like
> >>> SVQ vrings or CVQ buffers. The second one is the one used by the
> >>> memory listener to map the guest regions into the vdpa device.
> >>>
> >>> Implementation wise, only two trees are actually needed:
> >>> * Current iova_taddr_map that contains all IOVA->vaddr translations as
> >>> seen by the device, so both allocation functions can work on a single
> >>> tree. The function iova_tree_find_iova keeps using this one, so the
> >> I thought we had thorough discussion about this and agreed upon the
> >> decoupled IOVA allocator solution.
> >
> > My interpretation of it is to leave the allocator as the current one,
> > and create a new tree with GPA which is guaranteed to be unique. But
> > we can talk over it of course.
> >
>
> So you mean keep the full IOVA->HVA tree but also have a GPA->IOVA tree
> as well for guest memory regions, correct?
>

Right.

> >> But maybe I missed something earlier,
> >> I am not clear how come this iova_tree_find_iova function could still
> >> work with the full IOVA-> HVA tree when it comes to aliased memory or
> >> overlapped HVAs? Granted, for the memory map removal in the
> >> .region_del() path, we could rely on the GPA tree to locate the
> >> corresponding IOVA, but how come the translation path could figure out
> >> which IOVA range to return when the vaddr happens to fall in an
> >> overlapped HVA range?
> >
> > That is not a problem, as they both translate to the same address at the 
> > device.
> >
> > The most complicated situation is where we have a region contained in
> > another region, and the requested buffer crosses them. If the IOVA
> > tree returns the inner region, it will return the buffer chained with
> > the rest of the content in the outer region. Not optimal, but solved
> > either way.
> >
> > The only problem that comes to my mind is the case where the inner
> > region is RO and it is a write command, but I don't think we have this
> > case in a sane guest. A malicious guest cannot do any harm this way
> > anyway.
> >
> >> Do we still assume some overlapping order so we
> >> always return the first match from the tree? Or we expect every current
> >> user of iova_tree_find_iova should pass in GPA rather than HVA and use
> >> the vhost_iova_xxx_gpa API variant to look up IOVA?
> >>
> >
> > No, iova_tree_find_iova should keep asking for vaddr, as the result is
> > guaranteed to be there. Users of VhostIOVATree only need to modify how
> > they add or remove regions, knowing if they come from the guest or
> > not. As shown by this series, it is easier to do in that place than in
> > translation.
> >
> >> Thanks,
> >> -Siwei
> >>
> >>> user does not need to know if the address is from the guest or only
>

Re: [RFC 2/2] vhost-vdpa: Implement GPA->IOVA & IOVA->SVQ HVA trees

2024-08-30 Thread Eugenio Perez Martin

On Fri, Aug 30, 2024 at 3:58 PM Jonah Palmer  wrote:
>
>
>
> On 8/29/24 12:55 PM, Eugenio Perez Martin wrote:
> > On Wed, Aug 21, 2024 at 2:56 PM Jonah Palmer  
> > wrote:
> >>
> >> Implements a GPA->IOVA and IOVA->SVQ HVA tree for handling mapping,
> >> unmapping, and translations for guest and host-only memory,
> >> respectively.
> >>
> >> By splitting up a full IOVA->HVA tree (containing both guest and
> >> host-only memory mappings) into a GPA->IOVA tree (containing only guest
> >> memory mappings) and a IOVA->SVQ HVA tree (containing host-only memory
> >> mappings), we can avoid translating to the wrong IOVA when the guest has
> >> overlapping memory regions where different GPAs lead to the same HVA.
> >>
> >> In other words, if the guest has overlapping memory regions, translating
> >> an HVA to an IOVA may result in receiving an incorrect IOVA when
> >> searching the full IOVA->HVA tree. This would be due to one HVA range
> >> being contained (overlapping) in another HVA range in the IOVA->HVA
> >> tree.
> >>
> >> To avoid this issue, creating a GPA->IOVA tree and using it to translate
> >> a GPA to an IOVA ensures that the IOVA we receive is the correct one
> >> (instead of relying on a HVA->IOVA translation).
> >>
> >> As a byproduct of creating a GPA->IOVA tree, the full IOVA->HVA tree now
> >> becomes a partial IOVA->SVQ HVA tree. That is, since we're moving all
> >> guest memory mappings to the GPA->IOVA tree, the host-only memory
> >> mappings are now the only mappings being put into the IOVA->HVA tree.
> >>
> >> Furthermore, as an additional byproduct of splitting up guest and
> >> host-only memory mappings into separate trees, special attention needs
> >> to be paid to vhost_svq_translate_addr() when translating memory buffers
> >> from iovec. The memory buffers from iovec can be backed by guest memory
> >> or host-only memory, which means that we need to figure out who is
> >> backing these buffers and then decide which tree to use for translating
> >> it.
> >>
> >> In this patch we determine the backer of this buffer by first checking
> >> if a RAM block can be inferred from the buffer's HVA. That is, we use
> >> qemu_ram_block_from_host() and if a valid RAM block is returned, we know
> >> the buffer's HVA is backed by guest memory. Then we derive the GPA from
> >> it and translate the GPA to an IOVA using the GPA->IOVA tree.
> >>
> >> If an invalid RAM block is returned, the buffer's HVA is likely backed
> >> by host-only memory. In this case, we can then simply translate the HVA
> >> to an IOVA using the partial IOVA->SVQ HVA tree.
> >>
> >> However, this method is sub-optimal, especially for memory buffers
> >> backed by host-only memory, due to needing to iterate over some (if not
> >> all) RAMBlock structures and then searching either the GPA->IOVA tree or
> >> the IOVA->SVQ HVA tree. Optimizations to improve performance in this
> >> area should be revisited at some point.
> >>
> >> Signed-off-by: Jonah Palmer 
> >> ---
> >>   hw/virtio/vhost-iova-tree.c| 53 +-
> >>   hw/virtio/vhost-iova-tree.h|  5 ++-
> >>   hw/virtio/vhost-shadow-virtqueue.c | 48 +++
> >>   hw/virtio/vhost-vdpa.c | 18 +-
> >>   include/qemu/iova-tree.h   | 22 +
> >>   util/iova-tree.c   | 46 ++
> >>   6 files changed, 173 insertions(+), 19 deletions(-)
> >>
> >> diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
> >> index 32c03db2f5..5a3f6b5cd9 100644
> >> --- a/hw/virtio/vhost-iova-tree.c
> >> +++ b/hw/virtio/vhost-iova-tree.c
> >> @@ -26,15 +26,19 @@ struct VhostIOVATree {
> >>   /* Last addressable iova address in the device */
> >>   uint64_t iova_last;
> >>
> >> -/* IOVA address to qemu memory maps. */
> >> +/* IOVA address to qemu SVQ memory maps. */
> >>   IOVATree *iova_taddr_map;
> >>
> >>   /* IOVA tree (IOVA allocator) */
> >>   IOVATree *iova_map;
> >> +
> >> +/* GPA->IOVA tree */
> >> +IOVATree *gpa_map;
> >>   };
> >>
> >>   /**
> >>* Create a new VhostIOVATree with a new set of

Re: [RFC v3 3/3] vhost: Allocate memory for packed vring

2024-08-30 Thread Eugenio Perez Martin

On Fri, Aug 30, 2024 at 12:20 PM Sahil  wrote:
>
> Hi,
>
> On Tuesday, August 27, 2024 9:00:36 PM GMT+5:30 Eugenio Perez Martin wrote:
> > On Wed, Aug 21, 2024 at 2:20 PM Sahil  wrote:
> > > [...]
> > > I have been trying to test my changes so far as well. I am not very clear
> > > on a few things.
> > >
> > > Q1.
> > > I built QEMU from source with my changes and followed the vdpa_sim +
> > > vhost_vdpa tutorial [1]. The VM seems to be running fine. How do I check
> > > if the packed format is being used instead of the split vq format for
> > > shadow virtqueues? I know the packed format is used when virtio_vdev has
> > > got the VIRTIO_F_RING_PACKED bit enabled. Is there a way of checking that
> > > this is the case?
> >
> > You can see the features that the driver acked from the guest by
> > checking sysfs. Once you know the PCI BFN from lspci:
> > # lspci -nn|grep '\[1af4:1041\]'
> > 01:00.0 Ethernet controller [0200]: Red Hat, Inc. Virtio 1.0 network
> > device [1af4:1041] (rev 01)
> > # cut -c 35
> > /sys/devices/pci:00/:00:02.0/:01:00.0/virtio0/features 0
> >
> > Also, you can check from QEMU by simply tracing if your functions are
> > being called.
> >
> > > Q2.
> > > What's the recommended way to see what's going on under the hood? I tried
> > > using the -D option so QEMU's logs are written to a file but the file was
> > > empty. Would using qemu with -monitor stdio or attaching gdb to the QEMU
> > > VM be worthwhile?
> >
> > You need to add --trace options with the regex you want to get to
> > enable any output. For example, --trace 'vhost_vdpa_*' print all the
> > trace_vhost_vdpa_* functions.
> >
> > If you want to speed things up, you can just replace the interesting
> > trace_... functions with fprintf(stderr, ...). We can add the trace
> > ones afterwards.
>
> Understood. I am able to trace the functions that are being called with
> fprintf. I'll stick with fprintf for now.
>
> I realized that packed vqs are not being used in the test environment. I
> see that in "hw/virtio/vhost-shadow-virtqueue.c", svq->is_packed is set
> to 0 and that calls vhost_svq_add_split(). I am not sure how one enables
> the packed feature bit. I don't know if this is an environment issue.
>
> I built qemu from the latest source with my changes on top of it. I followed
> this article [1] to set up the environment.
>
> On the host machine:
>
> $ uname -a
> Linux fedora 6.10.5-100.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Aug 14 
> 15:49:25 UTC 2024 x86_64 GNU/Linux
>
> $ ./qemu/build/qemu-system-x86_64 --version
> QEMU emulator version 9.0.91
>
> $ vdpa -V
> vdpa utility, iproute2-6.4.0
>
> All the relevant vdpa modules have been loaded in accordance with [1].
>
> $ lsmod | grep -iE "(vdpa|virtio)"
> vdpa_sim_net12288  0
> vdpa_sim24576  1 vdpa_sim_net
> vringh  32768  2 vdpa_sim,vdpa_sim_net
> vhost_vdpa  32768  2
> vhost   65536  1 vhost_vdpa
> vhost_iotlb 16384  4 vdpa_sim,vringh,vhost_vdpa,vhost
> vdpa36864  3 vdpa_sim,vhost_vdpa,vdpa_sim_net
>
> $ ls -l /sys/bus/vdpa/devices/vdpa0/driver
> lrwxrwxrwx. 1 root root 0 Aug 30 11:25 /sys/bus/vdpa/devices/vdpa0/driver -> 
> ../../bus/vdpa/drivers/vhost_vdpa
>
> In the output of the following command, I see ANY_LAYOUT is supported.
> According to virtio_config.h [2] in the linux kernel, this represents the
> layout of descriptors. This refers to split and packed vqs, right?
>
> $ vdpa mgmtdev show
> vdpasim_net:
>   supported_classes net
>   max_supported_vqs 3
>   dev_features MTU MAC STATUS CTRL_VQ CTRL_MAC_ADDR ANY_LAYOUT VERSION_1 
> ACCESS_PLATFORM
>
> $ vdpa dev show -jp
> {
> "dev": {
> "vdpa0": {
> "type": "network",
> "mgmtdev": "vdpasim_net",
> "vendor_id": 0,
> "max_vqs": 3,
> "max_vq_size": 256
> }
> }
> }
>
> I started the VM by running:
>
> $ sudo ./qemu/build/qemu-system-x86_64 \
> -enable-kvm \
> -drive file=//home/ig91/fedora_qemu_test_vm/L1.qcow2,media=disk,if=virtio \
> -net nic,model=virtio \
> -net user,hostfwd=tcp::2226-:22 \
> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0 \
> -device 
> virtio-net-pci,netdev=vhost-vdpa0,bus=pci.0,addr=0x7,disable-legacy=on,disable-modern=off,page-per-vq=on,event_idx=off,packed=o

Re: [RFC 1/2] vhost-vdpa: Decouple the IOVA allocator

2024-08-30 Thread Eugenio Perez Martin

On Fri, Aug 30, 2024 at 6:20 AM Si-Wei Liu  wrote:
>
>
>
> On 8/29/2024 9:53 AM, Eugenio Perez Martin wrote:
> > On Wed, Aug 21, 2024 at 2:56 PM Jonah Palmer  
> > wrote:
> >> Decouples the IOVA allocator from the IOVA->HVA tree and instead adds
> >> the allocated IOVA range to an IOVA-only tree (iova_map). This IOVA tree
> >> will hold all IOVA ranges that have been allocated (e.g. in the
> >> IOVA->HVA tree) and are removed when any IOVA ranges are deallocated.
> >>
> >> A new API function vhost_iova_tree_insert() is also created to add a
> >> IOVA->HVA mapping into the IOVA->HVA tree.
> >>
> > I think this is a good first iteration but we can take steps to
> > simplify it. Also, it is great to be able to make points on real code
> > instead of designs on the air :).
> >
> > I expected a split of vhost_iova_tree_map_alloc between the current
> > vhost_iova_tree_map_alloc and vhost_iova_tree_map_alloc_gpa, or
> > similar. Similarly, a vhost_iova_tree_remove and
> > vhost_iova_tree_remove_gpa would be needed.
> >
> > The first one is used for regions that don't exist in the guest, like
> > SVQ vrings or CVQ buffers. The second one is the one used by the
> > memory listener to map the guest regions into the vdpa device.
> >
> > Implementation wise, only two trees are actually needed:
> > * Current iova_taddr_map that contains all IOVA->vaddr translations as
> > seen by the device, so both allocation functions can work on a single
> > tree. The function iova_tree_find_iova keeps using this one, so the
> I thought we had thorough discussion about this and agreed upon the
> decoupled IOVA allocator solution.

My interpretation of it is to leave the allocator as the current one,
and create a new tree with GPA which is guaranteed to be unique. But
we can talk over it of course.

> But maybe I missed something earlier,
> I am not clear how come this iova_tree_find_iova function could still
> work with the full IOVA-> HVA tree when it comes to aliased memory or
> overlapped HVAs? Granted, for the memory map removal in the
> .region_del() path, we could rely on the GPA tree to locate the
> corresponding IOVA, but how come the translation path could figure out
> which IOVA range to return when the vaddr happens to fall in an
> overlapped HVA range?

That is not a problem, as they both translate to the same address at the device.

The most complicated situation is where we have a region contained in
another region, and the requested buffer crosses them. If the IOVA
tree returns the inner region, it will return the buffer chained with
the rest of the content in the outer region. Not optimal, but solved
either way.

The only problem that comes to my mind is the case where the inner
region is RO and it is a write command, but I don't think we have this
case in a sane guest. A malicious guest cannot do any harm this way
anyway.

> Do we still assume some overlapping order so we
> always return the first match from the tree? Or we expect every current
> user of iova_tree_find_iova should pass in GPA rather than HVA and use
> the vhost_iova_xxx_gpa API variant to look up IOVA?
>

No, iova_tree_find_iova should keep asking for vaddr, as the result is
guaranteed to be there. Users of VhostIOVATree only need to modify how
they add or remove regions, knowing if they come from the guest or
not. As shown by this series, it is easier to do in that place than in
translation.

> Thanks,
> -Siwei
>
> > user does not need to know if the address is from the guest or only
> > exists in QEMU by using RAMBlock etc. All insert and remove functions
> > use this tree.
> > * A new tree that relates IOVA to GPA, that only
> > vhost_iova_tree_map_alloc_gpa and vhost_iova_tree_remove_gpa uses.
> >
> > The ideal case is that the key in this new tree is the GPA and the
> > value is the IOVA. But IOVATree's DMA is named the reverse: iova is
> > the key and translated_addr is the vaddr. We can create a new tree
> > struct for that, use GTree directly, or translate the reverse
> > linearly. As memory add / remove should not be frequent, I think the
> > simpler is the last one, but I'd be ok with creating a new tree.
> >
> > vhost_iova_tree_map_alloc_gpa needs to add the map to this new tree
> > also. Similarly, vhost_iova_tree_remove_gpa must look for the GPA in
> > this tree, and only remove the associated DMAMap in iova_taddr_map
> > that matches the IOVA.
> >
> > Does it make sense to you?
> >
> >> Signed-off-by: Jonah Palmer 
> >> ---
> >>   hw/virtio/vhost-iova-tree.c | 38 -
&g

Re: [RFC 2/2] vhost-vdpa: Implement GPA->IOVA & IOVA->SVQ HVA trees

2024-08-29 Thread Eugenio Perez Martin

On Wed, Aug 21, 2024 at 2:56 PM Jonah Palmer  wrote:
>
> Implements a GPA->IOVA and IOVA->SVQ HVA tree for handling mapping,
> unmapping, and translations for guest and host-only memory,
> respectively.
>
> By splitting up a full IOVA->HVA tree (containing both guest and
> host-only memory mappings) into a GPA->IOVA tree (containing only guest
> memory mappings) and a IOVA->SVQ HVA tree (containing host-only memory
> mappings), we can avoid translating to the wrong IOVA when the guest has
> overlapping memory regions where different GPAs lead to the same HVA.
>
> In other words, if the guest has overlapping memory regions, translating
> an HVA to an IOVA may result in receiving an incorrect IOVA when
> searching the full IOVA->HVA tree. This would be due to one HVA range
> being contained (overlapping) in another HVA range in the IOVA->HVA
> tree.
>
> To avoid this issue, creating a GPA->IOVA tree and using it to translate
> a GPA to an IOVA ensures that the IOVA we receive is the correct one
> (instead of relying on a HVA->IOVA translation).
>
> As a byproduct of creating a GPA->IOVA tree, the full IOVA->HVA tree now
> becomes a partial IOVA->SVQ HVA tree. That is, since we're moving all
> guest memory mappings to the GPA->IOVA tree, the host-only memory
> mappings are now the only mappings being put into the IOVA->HVA tree.
>
> Furthermore, as an additional byproduct of splitting up guest and
> host-only memory mappings into separate trees, special attention needs
> to be paid to vhost_svq_translate_addr() when translating memory buffers
> from iovec. The memory buffers from iovec can be backed by guest memory
> or host-only memory, which means that we need to figure out who is
> backing these buffers and then decide which tree to use for translating
> it.
>
> In this patch we determine the backer of this buffer by first checking
> if a RAM block can be inferred from the buffer's HVA. That is, we use
> qemu_ram_block_from_host() and if a valid RAM block is returned, we know
> the buffer's HVA is backed by guest memory. Then we derive the GPA from
> it and translate the GPA to an IOVA using the GPA->IOVA tree.
>
> If an invalid RAM block is returned, the buffer's HVA is likely backed
> by host-only memory. In this case, we can then simply translate the HVA
> to an IOVA using the partial IOVA->SVQ HVA tree.
>
> However, this method is sub-optimal, especially for memory buffers
> backed by host-only memory, due to needing to iterate over some (if not
> all) RAMBlock structures and then searching either the GPA->IOVA tree or
> the IOVA->SVQ HVA tree. Optimizations to improve performance in this
> area should be revisited at some point.
>
> Signed-off-by: Jonah Palmer 
> ---
>  hw/virtio/vhost-iova-tree.c| 53 +-
>  hw/virtio/vhost-iova-tree.h|  5 ++-
>  hw/virtio/vhost-shadow-virtqueue.c | 48 +++
>  hw/virtio/vhost-vdpa.c | 18 +-
>  include/qemu/iova-tree.h   | 22 +
>  util/iova-tree.c   | 46 ++
>  6 files changed, 173 insertions(+), 19 deletions(-)
>
> diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
> index 32c03db2f5..5a3f6b5cd9 100644
> --- a/hw/virtio/vhost-iova-tree.c
> +++ b/hw/virtio/vhost-iova-tree.c
> @@ -26,15 +26,19 @@ struct VhostIOVATree {
>  /* Last addressable iova address in the device */
>  uint64_t iova_last;
>
> -/* IOVA address to qemu memory maps. */
> +/* IOVA address to qemu SVQ memory maps. */
>  IOVATree *iova_taddr_map;
>
>  /* IOVA tree (IOVA allocator) */
>  IOVATree *iova_map;
> +
> +/* GPA->IOVA tree */
> +IOVATree *gpa_map;
>  };
>
>  /**
>   * Create a new VhostIOVATree with a new set of IOVATree's:
> + * - GPA->IOVA tree (gpa_map)
>   * - IOVA allocator (iova_map)
>   * - IOVA->HVA tree (iova_taddr_map)
>   *
> @@ -50,6 +54,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, 
> hwaddr iova_last)
>
>  tree->iova_taddr_map = iova_tree_new();
>  tree->iova_map = iova_tree_new();
> +tree->gpa_map = gpa_tree_new();
>  return tree;
>  }
>
> @@ -136,3 +141,49 @@ int vhost_iova_tree_insert(VhostIOVATree *iova_tree, 
> DMAMap *map)
>
>  return iova_tree_insert(iova_tree->iova_taddr_map, map);
>  }
> +
> +/**
> + * Insert a new GPA->IOVA mapping to the GPA->IOVA tree
> + *
> + * @iova_tree: The VhostIOVATree
> + * @map: The GPA->IOVA mapping
> + *
> + * Returns:
> + * - IOVA_OK if the map fits in the container
> + * - IOVA_ERR_INVALID if the map does not make sense (like size overflow)
> + * - IOVA_ERR_OVERLAP if the GPA range overlaps with an existing range
> + */
> +int vhost_gpa_tree_insert(VhostIOVATree *iova_tree, DMAMap *map)
> +{
> +if (map->iova + map->size < map->iova || map->perm == IOMMU_NONE) {
> +return IOVA_ERR_INVALID;
> +}
> +
> +return gpa_tree_insert(iova_tree->gpa_map, map);
> +}
> +
> +/**
> + * Find the IOVA address

Re: [RFC 1/2] vhost-vdpa: Decouple the IOVA allocator

2024-08-29 Thread Eugenio Perez Martin

On Wed, Aug 21, 2024 at 2:56 PM Jonah Palmer  wrote:
>
> Decouples the IOVA allocator from the IOVA->HVA tree and instead adds
> the allocated IOVA range to an IOVA-only tree (iova_map). This IOVA tree
> will hold all IOVA ranges that have been allocated (e.g. in the
> IOVA->HVA tree) and are removed when any IOVA ranges are deallocated.
>
> A new API function vhost_iova_tree_insert() is also created to add a
> IOVA->HVA mapping into the IOVA->HVA tree.
>

I think this is a good first iteration but we can take steps to
simplify it. Also, it is great to be able to make points on real code
instead of designs on the air :).

I expected a split of vhost_iova_tree_map_alloc between the current
vhost_iova_tree_map_alloc and vhost_iova_tree_map_alloc_gpa, or
similar. Similarly, a vhost_iova_tree_remove and
vhost_iova_tree_remove_gpa would be needed.

The first one is used for regions that don't exist in the guest, like
SVQ vrings or CVQ buffers. The second one is the one used by the
memory listener to map the guest regions into the vdpa device.

Implementation wise, only two trees are actually needed:
* Current iova_taddr_map that contains all IOVA->vaddr translations as
seen by the device, so both allocation functions can work on a single
tree. The function iova_tree_find_iova keeps using this one, so the
user does not need to know if the address is from the guest or only
exists in QEMU by using RAMBlock etc. All insert and remove functions
use this tree.
* A new tree that relates IOVA to GPA, that only
vhost_iova_tree_map_alloc_gpa and vhost_iova_tree_remove_gpa uses.

The ideal case is that the key in this new tree is the GPA and the
value is the IOVA. But IOVATree's DMA is named the reverse: iova is
the key and translated_addr is the vaddr. We can create a new tree
struct for that, use GTree directly, or translate the reverse
linearly. As memory add / remove should not be frequent, I think the
simpler is the last one, but I'd be ok with creating a new tree.

vhost_iova_tree_map_alloc_gpa needs to add the map to this new tree
also. Similarly, vhost_iova_tree_remove_gpa must look for the GPA in
this tree, and only remove the associated DMAMap in iova_taddr_map
that matches the IOVA.

Does it make sense to you?

> Signed-off-by: Jonah Palmer 
> ---
>  hw/virtio/vhost-iova-tree.c | 38 -
>  hw/virtio/vhost-iova-tree.h |  1 +
>  hw/virtio/vhost-vdpa.c  | 31 --
>  net/vhost-vdpa.c| 13 +++--
>  4 files changed, 70 insertions(+), 13 deletions(-)
>
> diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
> index 3d03395a77..32c03db2f5 100644
> --- a/hw/virtio/vhost-iova-tree.c
> +++ b/hw/virtio/vhost-iova-tree.c
> @@ -28,12 +28,17 @@ struct VhostIOVATree {
>
>  /* IOVA address to qemu memory maps. */
>  IOVATree *iova_taddr_map;
> +
> +/* IOVA tree (IOVA allocator) */
> +IOVATree *iova_map;
>  };
>
>  /**
> - * Create a new IOVA tree
> + * Create a new VhostIOVATree with a new set of IOVATree's:

s/IOVA tree/VhostIOVATree/ is good, but I think the rest is more an
implementation detail.

> + * - IOVA allocator (iova_map)
> + * - IOVA->HVA tree (iova_taddr_map)
>   *
> - * Returns the new IOVA tree
> + * Returns the new VhostIOVATree
>   */
>  VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
>  {
> @@ -44,6 +49,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, 
> hwaddr iova_last)
>  tree->iova_last = iova_last;
>
>  tree->iova_taddr_map = iova_tree_new();
> +tree->iova_map = iova_tree_new();
>  return tree;
>  }
>
> @@ -53,6 +59,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, 
> hwaddr iova_last)
>  void vhost_iova_tree_delete(VhostIOVATree *iova_tree)
>  {
>  iova_tree_destroy(iova_tree->iova_taddr_map);
> +iova_tree_destroy(iova_tree->iova_map);
>  g_free(iova_tree);
>  }
>
> @@ -88,13 +95,12 @@ int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap 
> *map)
>  /* Some vhost devices do not like addr 0. Skip first page */
>  hwaddr iova_first = tree->iova_first ?: qemu_real_host_page_size();
>
> -if (map->translated_addr + map->size < map->translated_addr ||

Why remove this condition? If the request is invalid we still need to
return an error here.

Maybe we should move it to iova_tree_alloc_map though.

> -map->perm == IOMMU_NONE) {
> +if (map->perm == IOMMU_NONE) {
>  return IOVA_ERR_INVALID;
>  }
>
>  /* Allocate a node in IOVA address */
> -return iova_tree_alloc_map(tree->iova_taddr_map, map, iova_first,
> +return iova_tree_alloc_map(tree->iova_map, map, iova_first,
> tree->iova_last);
>  }
>
> @@ -107,4 +113,26 @@ int vhost_iova_tree_map_alloc(VhostIOVATree *tree, 
> DMAMap *map)
>  void vhost_iova_tree_remove(VhostIOVATree *iova_tree, DMAMap map)
>  {
>  iova_tree_remove(iova_tree->iova_taddr_map, map);
> +iova_tree_remove(iova_tree->

Re: [RFC v3 3/3] vhost: Allocate memory for packed vring

2024-08-27 Thread Eugenio Perez Martin

On Wed, Aug 21, 2024 at 2:20 PM Sahil  wrote:
>
> Hi,
>
> Sorry for the late reply.
>
> On Tuesday, August 13, 2024 12:23:55 PM GMT+5:30 Eugenio Perez Martin wrote:
> > [...]
> > > I think I have understood what's going on in "vhost_vdpa_svq_map_rings",
> > > "vhost_vdpa_svq_map_ring" and "vhost_vdpa_dma_map". But based on
> > > what I have understood it looks like the driver area is getting mapped to
> > > an iova which is read-only for vhost_vdpa. Please let me know where I am
> > > going wrong.
> >
> > You're not going wrong there. The device does not need to write into
> > this area, so we map it read only.
> >
> > > Consider the following implementation in hw/virtio/vhost_vdpa.c:
> > > > size_t device_size = vhost_svq_device_area_size(svq);
> > > > size_t driver_size = vhost_svq_driver_area_size(svq);
> > >
> > > The driver size includes the descriptor area and the driver area. For
> > > packed vq, the driver area is the "driver event suppression" structure
> > > which should be read-only for the device according to the virtio spec
> > > (section 2.8.10) [1].
> > >
> > > > size_t avail_offset;
> > > > bool ok;
> > > >
> > > > vhost_svq_get_vring_addr(svq, &svq_addr);
> > >
> > > Over here "svq_addr.desc_user_addr" will point to the descriptor area
> > > while "svq_addr.avail_user_addr" will point to the driver area/driver
> > > event suppression structure.
> > >
> > > > driver_region = (DMAMap) {
> > > >
> > > > .translated_addr = svq_addr.desc_user_addr,
> > > > .size = driver_size - 1,
> > > > .perm = IOMMU_RO,
> > > >
> > > > };
> > >
> > > This region points to the descriptor area and its size encompasses the
> > > driver area as well with RO permission.
> > >
> > > > ok = vhost_vdpa_svq_map_ring(v, &driver_region, errp);
> > >
> > > The above function checks the value of needle->perm and sees that it is
> > > RO.
> > >
> > > It then calls "vhost_vdpa_dma_map" with the following arguments:
> > > > r = vhost_vdpa_dma_map(v->shared, v->address_space_id, needle->iova,
> > > >
> > > >needle->size + 1,
> > > >(void
> > > >*)(uintptr_t)needle->tra
> > > >nslated_addr,
> > > >needle->perm ==
> > > >IOMMU_RO);
> > >
> > > Since needle->size includes the driver area as well, the driver area will
> > > be mapped to a RO page in the device's address space, right?
> >
> > Yes, the device does not need to write into the descriptor area in the
> > supported split virtqueue case. So the descriptor area is also mapped
> > RO at this moment.
> >
> > This change in the packed virtqueue case, so we need to map it RW.
>
> I understand this now. I'll see how the implementation can be modified to take
> this into account. I'll see if making the driver area and descriptor ring 
> helps.
>
> > > > if (unlikely(!ok)) {
> > > >
> > > > error_prepend(errp, "Cannot create vq driver region: ");
> > > > return false;
> > > >
> > > > }
> > > > addr->desc_user_addr = driver_region.iova;
> > > > avail_offset = svq_addr.avail_user_addr - svq_addr.desc_user_addr;
> > > > addr->avail_user_addr = driver_region.iova + avail_offset;
> > >
> > > I think "addr->desc_user_addr" and "addr->avail_user_addr" will both be
> > > mapped to a RO page in the device's address space.
> > >
> > > > device_region = (DMAMap) {
> > > >
> > > > .translated_addr = svq_addr.used_user_addr,
> > > > .size = device_size - 1,
> > > > .perm = IOMMU_RW,
> > > >
> > > > };
> > >
> > > The device area/device event suppression structure on the other hand will
> > > be mapped to a RW page.
> > >
> > > I also think there are other issues with the current state of the patch.
> > > Accordin

Re: [RFC v3 3/3] vhost: Allocate memory for packed vring

2024-08-12 Thread Eugenio Perez Martin

On Mon, Aug 12, 2024 at 9:32 PM Sahil  wrote:
>
> Hi,
>
> On Monday, August 12, 2024 12:01:00 PM GMT+5:30 you wrote:
> > On Sun, Aug 11, 2024 at 7:20 PM Sahil  wrote:
> > > On Wednesday, August 7, 2024 9:52:10 PM GMT+5:30 Eugenio Perez Martin 
> > > wrote:
> > > > On Fri, Aug 2, 2024 at 1:22 PM Sahil Siddiq  
> > > > wrote:
> > > > > [...]
> > > > > @@ -726,17 +738,30 @@ void vhost_svq_start(VhostShadowVirtqueue *svq, 
> > > > > VirtIODevice *vdev,
> > > > >  svq->vring.num = virtio_queue_get_num(vdev,
> > > > >  virtio_get_queue_index(vq));
> > > > >  svq->num_free = svq->vring.num;
> > > > >
> > > > > -svq->vring.desc = mmap(NULL, vhost_svq_driver_area_size(svq),
> > > > > -   PROT_READ | PROT_WRITE, MAP_SHARED | 
> > > > > MAP_ANONYMOUS,
> > > > > -   -1, 0);
> > > > > -desc_size = sizeof(vring_desc_t) * svq->vring.num;
> > > > > -svq->vring.avail = (void *)((char *)svq->vring.desc + desc_size);
> > > > > -svq->vring.used = mmap(NULL, vhost_svq_device_area_size(svq),
> > > > > -   PROT_READ | PROT_WRITE, MAP_SHARED | 
> > > > > MAP_ANONYMOUS,
> > > > > -   -1, 0);
> > > > > -svq->desc_state = g_new0(SVQDescState, svq->vring.num);
> > > > > -svq->desc_next = g_new0(uint16_t, svq->vring.num);
> > > > > -for (unsigned i = 0; i < svq->vring.num - 1; i++) {
> > > > > +svq->is_packed = virtio_vdev_has_feature(svq->vdev, 
> > > > > VIRTIO_F_RING_PACKED);
> > > > > +
> > > > > +if (virtio_vdev_has_feature(svq->vdev, VIRTIO_F_RING_PACKED)) {
> > > > > +svq->vring_packed.vring.desc = mmap(NULL, 
> > > > > vhost_svq_memory_packed(svq),
> > > > > +  PROT_READ | PROT_WRITE, 
> > > > > MAP_SHARED | MAP_ANONYMOUS,
> > > > > +  -1, 0);
> > > > > +desc_size = sizeof(struct vring_packed_desc) * 
> > > > > svq->vring.num;
> > > > > +svq->vring_packed.vring.driver = (void *)((char 
> > > > > *)svq->vring_packed.vring.desc + desc_size);
> > > > > +svq->vring_packed.vring.device = (void *)((char 
> > > > > *)svq->vring_packed.vring.driver +
> > > > > + sizeof(struct 
> > > > > vring_packed_desc_event));
> > > >
> > > > This is a great start but it will be problematic when you start
> > > > mapping the areas to the vdpa device. The driver area should be read
> > > > only for the device, but it is placed in the same page as a RW one.
> > > >
> > > > More on this later.
> > > >
> > > > > +} else {
> > > > > +svq->vring.desc = mmap(NULL, vhost_svq_driver_area_size(svq),
> > > > > +   PROT_READ | PROT_WRITE, MAP_SHARED 
> > > > > |MAP_ANONYMOUS,
> > > > > +   -1, 0);
> > > > > +desc_size = sizeof(vring_desc_t) * svq->vring.num;
> > > > > +svq->vring.avail = (void *)((char *)svq->vring.desc + 
> > > > > desc_size);
> > > > > +svq->vring.used = mmap(NULL, vhost_svq_device_area_size(svq),
> > > > > +   PROT_READ | PROT_WRITE, MAP_SHARED 
> > > > > |MAP_ANONYMOUS,
> > > > > +   -1, 0);
> > > > > +}
> > > >
> > > > I think it will be beneficial to avoid "if (packed)" conditionals on
> > > > the exposed functions that give information about the memory maps.
> > > > These need to be replicated at
> > > > hw/virtio/vhost-vdpa.c:vhost_vdpa_svq_map_rings.
> > > >
> > > > However, the current one depends on the driver area to live in the
> > > > same page as the descriptor area, so it is not suitable for this.
> > >
> > > I haven't really understood this.
> > >
> > > In split vqs the descriptor, driver and device areas are mapped to RW 
> > > pages.
> > > In vhost_vdpa.c:vhos

Re: [RFC v3 3/3] vhost: Allocate memory for packed vring

2024-08-11 Thread Eugenio Perez Martin

On Sun, Aug 11, 2024 at 7:20 PM Sahil  wrote:
>
> Hi,
>
> On Wednesday, August 7, 2024 9:52:10 PM GMT+5:30 Eugenio Perez Martin wrote:
> > On Fri, Aug 2, 2024 at 1:22 PM Sahil Siddiq  wrote:
> > > [...]
> > > @@ -726,17 +738,30 @@ void vhost_svq_start(VhostShadowVirtqueue *svq, 
> > > VirtIODevice *vdev,
> > >  svq->vring.num = virtio_queue_get_num(vdev,
> > >  virtio_get_queue_index(vq));
> > >  svq->num_free = svq->vring.num;
> > >
> > > -svq->vring.desc = mmap(NULL, vhost_svq_driver_area_size(svq),
> > > -   PROT_READ | PROT_WRITE, MAP_SHARED | 
> > > MAP_ANONYMOUS,
> > > -   -1, 0);
> > > -desc_size = sizeof(vring_desc_t) * svq->vring.num;
> > > -svq->vring.avail = (void *)((char *)svq->vring.desc + desc_size);
> > > -svq->vring.used = mmap(NULL, vhost_svq_device_area_size(svq),
> > > -   PROT_READ | PROT_WRITE, MAP_SHARED | 
> > > MAP_ANONYMOUS,
> > > -   -1, 0);
> > > -svq->desc_state = g_new0(SVQDescState, svq->vring.num);
> > > -svq->desc_next = g_new0(uint16_t, svq->vring.num);
> > > -for (unsigned i = 0; i < svq->vring.num - 1; i++) {
> > > +svq->is_packed = virtio_vdev_has_feature(svq->vdev, 
> > > VIRTIO_F_RING_PACKED);
> > > +
> > > +if (virtio_vdev_has_feature(svq->vdev, VIRTIO_F_RING_PACKED)) {
> > > +svq->vring_packed.vring.desc = mmap(NULL, 
> > > vhost_svq_memory_packed(svq),
> > > +  PROT_READ | PROT_WRITE, 
> > > MAP_SHARED | MAP_ANONYMOUS,
> > > +  -1, 0);
> > > +desc_size = sizeof(struct vring_packed_desc) * svq->vring.num;
> > > +svq->vring_packed.vring.driver = (void *)((char 
> > > *)svq->vring_packed.vring.desc + desc_size);
> > > +svq->vring_packed.vring.device = (void *)((char 
> > > *)svq->vring_packed.vring.driver +
> > > + sizeof(struct 
> > > vring_packed_desc_event));
> >
> > This is a great start but it will be problematic when you start
> > mapping the areas to the vdpa device. The driver area should be read
> > only for the device, but it is placed in the same page as a RW one.
> >
> > More on this later.
> >
> > > +} else {
> > > +svq->vring.desc = mmap(NULL, vhost_svq_driver_area_size(svq),
> > > +   PROT_READ | PROT_WRITE, MAP_SHARED 
> > > |MAP_ANONYMOUS,
> > > +   -1, 0);
> > > +desc_size = sizeof(vring_desc_t) * svq->vring.num;
> > > +svq->vring.avail = (void *)((char *)svq->vring.desc + desc_size);
> > > +svq->vring.used = mmap(NULL, vhost_svq_device_area_size(svq),
> > > +   PROT_READ | PROT_WRITE, MAP_SHARED 
> > > |MAP_ANONYMOUS,
> > > +   -1, 0);
> > > +}
> >
> > I think it will be beneficial to avoid "if (packed)" conditionals on
> > the exposed functions that give information about the memory maps.
> > These need to be replicated at
> > hw/virtio/vhost-vdpa.c:vhost_vdpa_svq_map_rings.
> >
> > However, the current one depends on the driver area to live in the
> > same page as the descriptor area, so it is not suitable for this.
>
> I haven't really understood this.
>
> In split vqs the descriptor, driver and device areas are mapped to RW pages.
> In vhost_vdpa.c:vhost_vdpa_svq_map_rings, the regions are mapped with
> the appropriate "perm" field that sets the R/W permissions in the DMAMap
> object. Is this problematic for the split vq format because the avail ring is
> anyway mapped to a RW page in "vhost_svq_start"?
>

Ok so maybe the map word is misleading here. The pages needs to be
allocated for the QEMU process with both PROT_READ | PROT_WRITE, as
QEMU needs to write into it.

They are mapped to the device with vhost_vdpa_dma_map, and the last
bool parameter indicates if the device needs write permissions or not.
You can see how hw/virtio/vhost-vdpa.c:vhost_vdpa_svq_map_ring checks
the needle permission for this, and the needle permissions are stored
at hw/virtio/vhost-vdpa.c:vhost_vdpa_svq_map_rings. This is the
function that needs to check for the maps permissions.

> For packed vqs, the "Driver Event Suppression" data structure should be
> read-only for the device. Similar to split vqs, this is mapped to a RW page
> in "vhost_svq_start" but it is then mapped to a DMAMap object with read-
> only perms in "vhost_vdpa_svq_map_rings".
>
> I am a little confused about where the issue lies.
>
> Thanks,
> Sahil
>
>

[PATCH] hw/smbios: support for type 7 (cache information)

2024-08-11 Thread Hal Martin

This patch adds support for SMBIOS type 7 (Cache Information) to qemu.

level: cache level (1-8)
size: cache size in bytes

Example usage:
-smbios type=7,level=1,size=0x8000

Signed-off-by: Hal Martin 
---
 hw/smbios/smbios.c   | 63 
 include/hw/firmware/smbios.h | 18 +++
 qemu-options.hx  |  2 ++
 3 files changed, 83 insertions(+)

diff --git a/hw/smbios/smbios.c b/hw/smbios/smbios.c
index a394514264..65942f2354 100644
--- a/hw/smbios/smbios.c
+++ b/hw/smbios/smbios.c
@@ -83,6 +83,12 @@ static struct {
 .processor_family = 0x01, /* Other */
 };
 
+struct type7_instance {
+uint16_t level, size;
+QTAILQ_ENTRY(type7_instance) next;
+};
+static QTAILQ_HEAD(, type7_instance) type7 = QTAILQ_HEAD_INITIALIZER(type7);
+
 struct type8_instance {
 const char *internal_reference, *external_reference;
 uint8_t connector_type, port_type;
@@ -330,6 +336,23 @@ static const QemuOptDesc qemu_smbios_type4_opts[] = {
 { /* end of list */ }
 };
 
+static const QemuOptDesc qemu_smbios_type7_opts[] = {
+{
+.name = "type",
+.type = QEMU_OPT_NUMBER,
+.help = "SMBIOS element type",
+},{
+.name = "level",
+.type = QEMU_OPT_NUMBER,
+.help = "cache level",
+},{
+.name = "size",
+.type = QEMU_OPT_NUMBER,
+.help = "cache size",
+},
+{ /* end of list */ }
+};
+
 static const QemuOptDesc qemu_smbios_type8_opts[] = {
 {
 .name = "type",
@@ -733,6 +756,32 @@ static void smbios_build_type_4_table(MachineState *ms, 
unsigned instance,
 smbios_type4_count++;
 }
 
+static void smbios_build_type_7_table(void)
+{
+unsigned instance = 0;
+struct type7_instance *t7;
+char designation[20];
+
+QTAILQ_FOREACH(t7, &type7, next) {
+SMBIOS_BUILD_TABLE_PRE(7, T0_BASE + instance, true);
+sprintf(designation, "CPU Internal L%d", t7->level);
+SMBIOS_TABLE_SET_STR(7, socket_designation, designation);
+t->cache_configuration =  0x180 | (t7->level-1); /* not socketed, 
enabled, write back*/
+t->installed_size =  t7->size;
+t->maximum_cache_size =  t7->size; /* set max to installed */
+t->supported_sram_type = 0x10; /* pipeline burst */
+t->current_sram_type = 0x10; /* pipeline burst */
+t->cache_speed = 0x1; /* 1 ns */
+t->error_correction_type = 0x6; /* Multi-bit ECC */
+t->system_cache_type = 0x05; /* Unified */
+t->associativity = 0x6; /* Fully Associative */
+t->maximum_cache_size2 = t7->size;
+t->installed_cache_size2 = t7->size;
+SMBIOS_BUILD_TABLE_POST;
+instance++;
+}
+}
+
 static void smbios_build_type_8_table(void)
 {
 unsigned instance = 0;
@@ -1120,6 +1169,7 @@ static bool smbios_get_tables_ep(MachineState *ms,
 }
 }
 
+smbios_build_type_7_table();
 smbios_build_type_8_table();
 smbios_build_type_9_table(errp);
 smbios_build_type_11_table();
@@ -1478,6 +1528,19 @@ void smbios_entry_add(QemuOpts *opts, Error **errp)
UINT16_MAX);
 }
 return;
+case 7:
+if (!qemu_opts_validate(opts, qemu_smbios_type7_opts, errp)) {
+return;
+}
+struct type7_instance *t7_i;
+t7_i = g_new0(struct type7_instance, 1);
+t7_i->level = qemu_opt_get_number(opts,"level", 0x0);
+t7_i->size = qemu_opt_get_number(opts, "size", 0x0200);
+/* Only cache levels 1-8 are permitted */
+if (t7_i->level > 0 && t7_i->level < 9) {
+QTAILQ_INSERT_TAIL(&type7, t7_i, next);
+}
+return;
 case 8:
 if (!qemu_opts_validate(opts, qemu_smbios_type8_opts, errp)) {
 return;
diff --git a/include/hw/firmware/smbios.h b/include/hw/firmware/smbios.h
index f066ab7262..1ea1506b46 100644
--- a/include/hw/firmware/smbios.h
+++ b/include/hw/firmware/smbios.h
@@ -220,6 +220,24 @@ typedef enum smbios_type_4_len_ver {
 SMBIOS_TYPE_4_LEN_V30 = offsetofend(struct smbios_type_4, thread_count2),
 } smbios_type_4_len_ver;
 
+/* SMBIOS type 7 - Cache Information (v2.0+) */
+struct smbios_type_7 {
+struct smbios_structure_header header;
+uint8_t socket_designation;
+uint16_t cache_configuration;
+uint16_t maximum_cache_size;
+uint16_t installed_size;
+uint16_t supported_sram_type;
+uint16_t current_sram_type;
+uint8_t cache_speed;
+uint8_t error_correction_type;
+uint8_t system_cache_type;
+uint8_t associativity;
+uint32_t maximum_cache_size2;
+uint32_t installed_cache_size2;
+/* contained elements follow */
+} QEMU_PACKED;
+
 /* SMBIOS type 8 - Port Connector In

Re: [RFC v3 0/3] Add packed virtqueue to shadow virtqueue

2024-08-07 Thread Eugenio Perez Martin

On Fri, Aug 2, 2024 at 1:22 PM Sahil Siddiq  wrote:
>
> Hi,
>
> Here's a new patch series that incorporates all
> the suggested changes from v2.
>
> I have tried my best to deduplicate the implementation.
> Please let me know if I have missed something.
>

I think they are in good shape :).

> I'll also test these changes out by following the
> suggestions given in response to v1. I'll have more
> confidence once I know these changes work.
>

Please let me know if you need help with the testing!

> Thanks,
> Sahil
>
> v1: https://lists.nongnu.org/archive/html/qemu-devel/2024-06/msg03417.html
> v2: https://lists.nongnu.org/archive/html/qemu-devel/2024-07/msg06196.html
>
> Changes v2 -> v3:
> * vhost-shadow-virtqueue.c
>   - Move parts common to "vhost_svq_add_split" and
> "vhost_svq_add_packed" to "vhost_svq_add".
>   (vhost_svq_add_packed):
>   - Refactor to minimize duplicate code between
> this and "vhost_svq_add_split"
>   - Fix code style issues.
>   (vhost_svq_add_split):
>   - Merge with "vhost_svq_vring_write_descs()"
>   - Refactor to minimize duplicate code between
> this and "vhost_svq_add_packed"
>   (vhost_svq_add):
>   - Refactor to minimize duplicate code between
> split and packed version of "vhost_svq_add"
>   (vhost_svq_memory_packed): New function
>   (vhost_svq_start):
>   - Remove common variables out of if-else branch.
>   (vhost_svq_stop):
>   - Add support for packed vq.
>   (vhost_svq_get_vring_addr): Revert changes
>   (vhost_svq_get_vring_addr_packed): Likwise.
> * vhost-shadow-virtqueue.h
>   - Revert changes made to "vhost_svq_get_vring_addr*"
> functions.
> * vhost-vdpa.c: Revert changes.
>
> Sahil Siddiq (3):
>   vhost: Introduce packed vq and add buffer elements
>   vhost: Data structure changes to support packed vqs
>   vhost: Allocate memory for packed vring
>
>  hw/virtio/vhost-shadow-virtqueue.c | 230 -
>  hw/virtio/vhost-shadow-virtqueue.h |  70 ++---
>  2 files changed, 206 insertions(+), 94 deletions(-)
>
> --
> 2.45.2
>

Re: [RFC v3 1/3] vhost: Introduce packed vq and add buffer elements

2024-08-07 Thread Eugenio Perez Martin

On Fri, Aug 2, 2024 at 1:22 PM Sahil Siddiq  wrote:
>
> This is the first patch in a series to add support for packed
> virtqueues in vhost_shadow_virtqueue. This patch implements the
> insertion of available buffers in the descriptor area. It takes
> into account descriptor chains, but does not consider indirect
> descriptors.
>
> Signed-off-by: Sahil Siddiq 
> ---
> Changes v2 -> v3:
> * vhost-shadow-virtqueue.c
>   - Move parts common to "vhost_svq_add_split" and
> "vhost_svq_add_packed" to "vhost_svq_add".
>   (vhost_svq_add_packed):
>   - Refactor to minimize duplicate code between
> this and "vhost_svq_add_split"
>   - Fix code style issues.
>   (vhost_svq_add_split):
>   - Merge with "vhost_svq_vring_write_descs()"
>   - Refactor to minimize duplicate code between
> this and "vhost_svq_add_packed"
>   (vhost_svq_add):
>   - Refactor to minimize duplicate code between
> split and packed version of "vhost_svq_add"
>
>  hw/virtio/vhost-shadow-virtqueue.c | 174 +++--
>  1 file changed, 115 insertions(+), 59 deletions(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c 
> b/hw/virtio/vhost-shadow-virtqueue.c
> index fc5f408f77..4c308ee53d 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -124,97 +124,132 @@ static bool vhost_svq_translate_addr(const 
> VhostShadowVirtqueue *svq,
>  }
>
>  /**
> - * Write descriptors to SVQ vring
> + * Write descriptors to SVQ split vring
>   *
>   * @svq: The shadow virtqueue
> - * @sg: Cache for hwaddr
> - * @iovec: The iovec from the guest
> - * @num: iovec length
> - * @more_descs: True if more descriptors come in the chain
> - * @write: True if they are writeable descriptors
> - *
> - * Return true if success, false otherwise and print error.
> + * @out_sg: The iovec to the guest
> + * @out_num: Outgoing iovec length
> + * @in_sg: The iovec from the guest
> + * @in_num: Incoming iovec length
> + * @sgs: Cache for hwaddr
> + * @head: Saves current free_head
>   */
> -static bool vhost_svq_vring_write_descs(VhostShadowVirtqueue *svq, hwaddr 
> *sg,
> -const struct iovec *iovec, size_t 
> num,
> -bool more_descs, bool write)
> +static void vhost_svq_add_split(VhostShadowVirtqueue *svq,
> +const struct iovec *out_sg, size_t out_num,
> +const struct iovec *in_sg, size_t in_num,
> +hwaddr *sgs, unsigned *head)
>  {
> +unsigned avail_idx, n;
>  uint16_t i = svq->free_head, last = svq->free_head;
> -unsigned n;
> -uint16_t flags = write ? cpu_to_le16(VRING_DESC_F_WRITE) : 0;
> +vring_avail_t *avail = svq->vring.avail;
>  vring_desc_t *descs = svq->vring.desc;
> -bool ok;
> -
> -if (num == 0) {
> -return true;
> -}
> +size_t num = in_num + out_num;
>
> -ok = vhost_svq_translate_addr(svq, sg, iovec, num);
> -if (unlikely(!ok)) {
> -return false;
> -}
> +*head = svq->free_head;
>
>  for (n = 0; n < num; n++) {
> -if (more_descs || (n + 1 < num)) {
> -descs[i].flags = flags | cpu_to_le16(VRING_DESC_F_NEXT);
> +descs[i].flags = cpu_to_le16(n < out_num ? 0 : VRING_DESC_F_WRITE);
> +if (n + 1 < num) {
> +descs[i].flags |= cpu_to_le16(VRING_DESC_F_NEXT);
>  descs[i].next = cpu_to_le16(svq->desc_next[i]);
> +}
> +
> +descs[i].addr = cpu_to_le64(sgs[n]);
> +if (n < out_num) {
> +descs[i].len = cpu_to_le32(out_sg[n].iov_len);
>  } else {
> -descs[i].flags = flags;
> +descs[i].len = cpu_to_le32(in_sg[n - out_num].iov_len);
>  }
> -descs[i].addr = cpu_to_le64(sg[n]);
> -descs[i].len = cpu_to_le32(iovec[n].iov_len);
>
>  last = i;
>  i = cpu_to_le16(svq->desc_next[i]);
>  }
>
>  svq->free_head = le16_to_cpu(svq->desc_next[last]);
> -return true;
> +
> +/*
> + * Put the entry in the available array (but don't update avail->idx 
> until
> + * they do sync).
> + */
> +avail_idx = svq->shadow_avail_idx & (svq->vring.num - 1);
> +avail->ring[avail_idx] = cpu_to_le16(*head);
> +svq->shadow_avail_idx++;
> +
> +/* Update the avail index after write the descriptor */
> +smp_wmb();
> +avail->idx = cpu_to_le16(svq->shadow_avail_idx);
>  }
>

I think this code is already in a very good shape. But actual testing
is needed before acks.

As a suggestion, we can split it into:
1) Refactor in vhost_svq_translate_addr to support out_num+in_num. No
functional change.
2) Refactor vhost_svq_add_split to extract common code into
vhost_svq_add. No functional change.
3) Adding packed code.

How to split or merge the patches is not a well-defined thing, so I'm
happy with this patch if you think the reactor is not worth it.

> -static bool vhost_svq_add_split(Vh

Re: [RFC v3 3/3] vhost: Allocate memory for packed vring

2024-08-07 Thread Eugenio Perez Martin

On Fri, Aug 2, 2024 at 1:22 PM Sahil Siddiq  wrote:
>
> Allocate memory for the packed vq format and support
> packed vq in the SVQ "start" and "stop" operations.
>
> Signed-off-by: Sahil Siddiq 
> ---
> Changes v2 -> v3:
> * vhost-shadow-virtqueue.c
>   (vhost_svq_memory_packed): New function
>   (vhost_svq_start):
>   - Remove common variables out of if-else branch.
>   (vhost_svq_stop):
>   - Add support for packed vq.
>   (vhost_svq_get_vring_addr): Revert changes
>   (vhost_svq_get_vring_addr_packed): Likwise.
> * vhost-shadow-virtqueue.h
>   - Revert changes made to "vhost_svq_get_vring_addr*"
> functions.
> * vhost-vdpa.c: Revert changes.
>
>  hw/virtio/vhost-shadow-virtqueue.c | 56 +++---
>  hw/virtio/vhost-shadow-virtqueue.h |  4 +++
>  2 files changed, 47 insertions(+), 13 deletions(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c 
> b/hw/virtio/vhost-shadow-virtqueue.c
> index 4c308ee53d..f4285db2b4 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -645,6 +645,8 @@ void vhost_svq_set_svq_call_fd(VhostShadowVirtqueue *svq, 
> int call_fd)
>
>  /**
>   * Get the shadow vq vring address.
> + * This is used irrespective of whether the
> + * split or packed vq format is used.
>   * @svq: Shadow virtqueue
>   * @addr: Destination to store address
>   */
> @@ -672,6 +674,16 @@ size_t vhost_svq_device_area_size(const 
> VhostShadowVirtqueue *svq)
>  return ROUND_UP(used_size, qemu_real_host_page_size());
>  }
>
> +size_t vhost_svq_memory_packed(const VhostShadowVirtqueue *svq)
> +{
> +size_t desc_size = sizeof(struct vring_packed_desc) * svq->num_free;
> +size_t driver_event_suppression = sizeof(struct vring_packed_desc_event);
> +size_t device_event_suppression = sizeof(struct vring_packed_desc_event);
> +
> +return ROUND_UP(desc_size + driver_event_suppression + 
> device_event_suppression,
> +qemu_real_host_page_size());
> +}
> +
>  /**
>   * Set a new file descriptor for the guest to kick the SVQ and notify for 
> avail
>   *
> @@ -726,17 +738,30 @@ void vhost_svq_start(VhostShadowVirtqueue *svq, 
> VirtIODevice *vdev,
>
>  svq->vring.num = virtio_queue_get_num(vdev, virtio_get_queue_index(vq));
>  svq->num_free = svq->vring.num;
> -svq->vring.desc = mmap(NULL, vhost_svq_driver_area_size(svq),
> -   PROT_READ | PROT_WRITE, MAP_SHARED | 
> MAP_ANONYMOUS,
> -   -1, 0);
> -desc_size = sizeof(vring_desc_t) * svq->vring.num;
> -svq->vring.avail = (void *)((char *)svq->vring.desc + desc_size);
> -svq->vring.used = mmap(NULL, vhost_svq_device_area_size(svq),
> -   PROT_READ | PROT_WRITE, MAP_SHARED | 
> MAP_ANONYMOUS,
> -   -1, 0);
> -svq->desc_state = g_new0(SVQDescState, svq->vring.num);
> -svq->desc_next = g_new0(uint16_t, svq->vring.num);
> -for (unsigned i = 0; i < svq->vring.num - 1; i++) {
> +svq->is_packed = virtio_vdev_has_feature(svq->vdev, 
> VIRTIO_F_RING_PACKED);
> +
> +if (virtio_vdev_has_feature(svq->vdev, VIRTIO_F_RING_PACKED)) {
> +svq->vring_packed.vring.desc = mmap(NULL, 
> vhost_svq_memory_packed(svq),
> +PROT_READ | PROT_WRITE, 
> MAP_SHARED | MAP_ANONYMOUS,
> +-1, 0);
> +desc_size = sizeof(struct vring_packed_desc) * svq->vring.num;
> +svq->vring_packed.vring.driver = (void *)((char 
> *)svq->vring_packed.vring.desc + desc_size);
> +svq->vring_packed.vring.device = (void *)((char 
> *)svq->vring_packed.vring.driver +
> +  sizeof(struct 
> vring_packed_desc_event));

This is a great start but it will be problematic when you start
mapping the areas to the vdpa device. The driver area should be read
only for the device, but it is placed in the same page as a RW one.

More on this later.

> +} else {
> +svq->vring.desc = mmap(NULL, vhost_svq_driver_area_size(svq),
> +   PROT_READ | PROT_WRITE, MAP_SHARED | 
> MAP_ANONYMOUS,
> +   -1, 0);
> +desc_size = sizeof(vring_desc_t) * svq->vring.num;
> +svq->vring.avail = (void *)((char *)svq->vring.desc + desc_size);
> +svq->vring.used = mmap(NULL, vhost_svq_device_area_size(svq),
> +   PROT_READ | PROT_WRITE, MAP_SHARED | 
> MAP_ANONYMOUS,
> +   -1, 0);
> +}

I think it will be beneficial to avoid "if (packed)" conditionals on
the exposed functions that give information about the memory maps.
These need to be replicated at
hw/virtio/vhost-vdpa.c:vhost_vdpa_svq_map_rings.

However, the current one depends on the driver area to live in the
same page as the descriptor area, so it is not suitable for this.

So what about this action plan:
1) Make the avail ring (or driver area) i

Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree

2024-08-01 Thread Eugenio Perez Martin

On Thu, Aug 1, 2024 at 2:41 AM Si-Wei Liu  wrote:
>
> Hi Jonah,
>
> On 7/31/2024 7:09 AM, Jonah Palmer wrote:
> >
> >> Let me clarify, correct me if I was wrong:
> >>
> >> 1) IOVA allocator is still implemented via a tree, we just
> >> don't need
> >> to store how the IOVA is used
> >> 2) A dedicated GPA -> IOVA tree, updated via listeners and is
> >> used in
> >> the datapath SVQ translation
> >> 3) A linear mapping or another SVQ -> IOVA tree used for SVQ
> >>
> >
> > His solution is composed of three trees:
> > 1) One for the IOVA allocations, so we know where to allocate
> > new ranges
> > 2) One of the GPA -> SVQ IOVA translations.
> > 3) Another one for SVQ vrings translations.
> >
> >
> >>>
> >
> > For my understanding, say we have those 3 memory mappings:
> >
> > HVAGPAIOVA
> > ---
> > Map
> > (1) [0x7f7903e0, 0x7f7983e0)[0x0, 0x8000) [0x1000,
> > 0x8000)
> > (2) [0x7f7983e0, 0x7f9903e0)[0x1, 0x208000)
> > [0x80001000, 0x201000)
> > (3) [0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc)
> > [0x201000, 0x221000)
> >
> > And then say when we go to unmap (e.g. vhost_vdpa_svq_unmap_ring)
> > we're given an HVA of 0x7f7903eb, which fits in both the first and
> > third mappings.
> >
> > The correct one to remove here would be the third mapping, right? Not
> > only because the HVA range of the third mapping has a more "specific"
> > or "tighter" range fit given an HVA of 0x7f7903eb (which, as I
> > understand, may not always be the case in other scenarios), but mainly
> > because the HVA->GPA translation would give GPA 0xfedb, which only
> > fits in the third mapping's GPA range. Am I understanding this correctly?
> You're correct, we would still need a GPA -> IOVA tree for mapping and
> unmapping on guest mem. I've talked to Eugenio this morning and I think
> he is now aligned. Granted, this GPA tree is partial in IOVA space that
> doesn't contain ranges from host-only memory (e.g. backed by SVQ
> descriptors or buffers), we could create an API variant to
> vhost_iova_tree_map_alloc() and vhost_iova_tree_map_remove(), which not
> just adds IOVA -> HVA range to the HVA tree, but also manipulates the
> GPA tree to maintain guest memory mappings, i.e. only invoked from the
> memory listener ops. Such that this new API is distinguishable from the
> one in the SVQ mapping and unmapping path that only manipulates the HVA
> tree.
>

Right, I think I understand both Jason's and your approach better, and
I think it is the best one. To modify the lookup API is hard, as the
caller does not know if the HVA looked up is contained in the guest
memory or not. To modify the add or remove regions is easier, as they
know it.

> I think the only case that you may need to pay attention to in
> implementation is in the SVQ address translation path, where if you come
> to an HVA address for translation, you would need to tell apart which
> tree you'd have to look up - if this HVA is backed by guest mem you
> could use API qemu_ram_block_from_host() to infer the ram block then the
> GPA, so you end up doing a lookup on the GPA tree; or else the HVA may
> be from the SVQ mappings, where you'd have to search the HVA tree again
> to look for host-mem-only range before you can claim the HVA is a
> bogus/unmapped address...

I'd leave this HVA -> IOVA tree for future performance optimization on
top, and focus on the aliased maps for a first series.

However, calling qemu_ram_block_from_host is actually not needed if
the HVA tree contains all the translations, both SVQ and guest buffers
in memory.

> For now, this additional second lookup is
> sub-optimal but inadvitable, but I think both of us agreed that you
> could start to implement this version first, and look for future
> opportunity to optimize the lookup performance on top.
>

Right, thanks for explaining!

> >
> > ---
> >
> > In the case where the first mapping here is removed (GPA [0x0,
> > 0x8000)), why do we use the word "reintroduce" here? As I
> > understand it, when we remove a mapping, we're essentially
> > invalidating the IOVA range associated with that mapping, right? In
> > other words, the IOVA ranges here don't overlap, so removing a mapping
> > where its HVA range overlaps another mapping's HVA range shouldn't
> > affect the other mapping since they have unique IOVA ranges. Is my
> > understanding correct here or am I probably missing something?
> With the GPA tree I think this case should work fine. I've double
> checked the implementation of vhost-vdpa iotlb, and doesn't see a red
> flag there.
>
> Thanks,
> -Siwei
>

Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree

2024-07-31 Thread Eugenio Perez Martin

On Tue, Jul 30, 2024 at 2:32 PM Jonah Palmer  wrote:
>
>
>
> On 7/30/24 7:00 AM, Eugenio Perez Martin wrote:
> > On Tue, Jul 30, 2024 at 10:48 AM Jason Wang  wrote:
> >>
> >> On Mon, Jul 29, 2024 at 6:05 PM Eugenio Perez Martin
> >>  wrote:
> >>>
> >>> On Wed, Jul 24, 2024 at 7:00 PM Jonah Palmer  
> >>> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 5/13/24 11:56 PM, Jason Wang wrote:
> >>>>> On Mon, May 13, 2024 at 5:58 PM Eugenio Perez Martin
> >>>>>  wrote:
> >>>>>>
> >>>>>> On Mon, May 13, 2024 at 10:28 AM Jason Wang  
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> On Mon, May 13, 2024 at 2:28 PM Eugenio Perez Martin
> >>>>>>>  wrote:
> >>>>>>>>
> >>>>>>>> On Sat, May 11, 2024 at 6:07 AM Jason Wang  
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin
> >>>>>>>>>  wrote:
> >>>>>>>>>>
> >>>>>>>>>> On Fri, May 10, 2024 at 6:29 AM Jason Wang  
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin 
> >>>>>>>>>>>  wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, May 9, 2024 at 8:27 AM Jason Wang  
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin 
> >>>>>>>>>>>>>  wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Wed, May 8, 2024 at 4:29 AM Jason Wang 
> >>>>>>>>>>>>>>  wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin 
> >>>>>>>>>>>>>>>  wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Tue, May 7, 2024 at 9:29 AM Jason Wang 
> >>>>>>>>>>>>>>>>  wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> >>>>>>>>>>>>>>>>>  wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Fri, Apr 12, 2024 at 8:47 AM Jason Wang 
> >>>>>>>>>>>>>>>>>>  wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez 
> >>>>>>>>>>>>>>>>>>>  wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> The guest may have overlapped memory regions, where 
> >>>>>>>>>>>>>>>>>>>> different GPA leads
> >>>>>>>>>>>>>>>>>>>> to the same HVA.  This causes a problem when overlapped 
> >>>>>>>>>>>>>>>>>>>> regions
> >>>>>>>>>>>>>>>>>>>> (different GPA but same translated HVA) exists in the 
> >>>>>>>>>>>>>>>>>>>> tree, as looking
> >>>>>>>>>>>>>>>>>>>> them by HVA will return them twice.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I think I don't understand if there's any side effect for 
> >>>>>>>>>>&g

Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree

2024-07-30 Thread Eugenio Perez Martin

On Tue, Jul 30, 2024 at 10:48 AM Jason Wang  wrote:
>
> On Mon, Jul 29, 2024 at 6:05 PM Eugenio Perez Martin
>  wrote:
> >
> > On Wed, Jul 24, 2024 at 7:00 PM Jonah Palmer  
> > wrote:
> > >
> > >
> > >
> > > On 5/13/24 11:56 PM, Jason Wang wrote:
> > > > On Mon, May 13, 2024 at 5:58 PM Eugenio Perez Martin
> > > >  wrote:
> > > >>
> > > >> On Mon, May 13, 2024 at 10:28 AM Jason Wang  
> > > >> wrote:
> > > >>>
> > > >>> On Mon, May 13, 2024 at 2:28 PM Eugenio Perez Martin
> > > >>>  wrote:
> > > >>>>
> > > >>>> On Sat, May 11, 2024 at 6:07 AM Jason Wang  
> > > >>>> wrote:
> > > >>>>>
> > > >>>>> On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin
> > > >>>>>  wrote:
> > > >>>>>>
> > > >>>>>> On Fri, May 10, 2024 at 6:29 AM Jason Wang  
> > > >>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>> On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin 
> > > >>>>>>>  wrote:
> > > >>>>>>>>
> > > >>>>>>>> On Thu, May 9, 2024 at 8:27 AM Jason Wang  
> > > >>>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>> On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin 
> > > >>>>>>>>>  wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Wed, May 8, 2024 at 4:29 AM Jason Wang 
> > > >>>>>>>>>>  wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin 
> > > >>>>>>>>>>>  wrote:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> On Tue, May 7, 2024 at 9:29 AM Jason Wang 
> > > >>>>>>>>>>>>  wrote:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> > > >>>>>>>>>>>>>  wrote:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> On Fri, Apr 12, 2024 at 8:47 AM Jason Wang 
> > > >>>>>>>>>>>>>>  wrote:
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez 
> > > >>>>>>>>>>>>>>>  wrote:
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> The guest may have overlapped memory regions, where 
> > > >>>>>>>>>>>>>>>> different GPA leads
> > > >>>>>>>>>>>>>>>> to the same HVA.  This causes a problem when overlapped 
> > > >>>>>>>>>>>>>>>> regions
> > > >>>>>>>>>>>>>>>> (different GPA but same translated HVA) exists in the 
> > > >>>>>>>>>>>>>>>> tree, as looking
> > > >>>>>>>>>>>>>>>> them by HVA will return them twice.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> I think I don't understand if there's any side effect for 
> > > >>>>>>>>>>>>>>> shadow virtqueue?
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> My bad, I totally forgot to put a reference to where this 
> > > >>>>>>>>>>>>>> comes from.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>

Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree

2024-07-29 Thread Eugenio Perez Martin

On Mon, Jul 29, 2024 at 7:50 PM Jonah Palmer  wrote:
>
>
>
> On 7/29/24 6:04 AM, Eugenio Perez Martin wrote:
> > On Wed, Jul 24, 2024 at 7:00 PM Jonah Palmer  
> > wrote:
> >>
> >>
> >>
> >> On 5/13/24 11:56 PM, Jason Wang wrote:
> >>> On Mon, May 13, 2024 at 5:58 PM Eugenio Perez Martin
> >>>  wrote:
> >>>>
> >>>> On Mon, May 13, 2024 at 10:28 AM Jason Wang  wrote:
> >>>>>
> >>>>> On Mon, May 13, 2024 at 2:28 PM Eugenio Perez Martin
> >>>>>  wrote:
> >>>>>>
> >>>>>> On Sat, May 11, 2024 at 6:07 AM Jason Wang  wrote:
> >>>>>>>
> >>>>>>> On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin
> >>>>>>>  wrote:
> >>>>>>>>
> >>>>>>>> On Fri, May 10, 2024 at 6:29 AM Jason Wang  
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin 
> >>>>>>>>>  wrote:
> >>>>>>>>>>
> >>>>>>>>>> On Thu, May 9, 2024 at 8:27 AM Jason Wang  
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin 
> >>>>>>>>>>>  wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, May 8, 2024 at 4:29 AM Jason Wang  
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin 
> >>>>>>>>>>>>>  wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, May 7, 2024 at 9:29 AM Jason Wang 
> >>>>>>>>>>>>>>  wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> >>>>>>>>>>>>>>>  wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Fri, Apr 12, 2024 at 8:47 AM Jason Wang 
> >>>>>>>>>>>>>>>>  wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez 
> >>>>>>>>>>>>>>>>>  wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> The guest may have overlapped memory regions, where 
> >>>>>>>>>>>>>>>>>> different GPA leads
> >>>>>>>>>>>>>>>>>> to the same HVA.  This causes a problem when overlapped 
> >>>>>>>>>>>>>>>>>> regions
> >>>>>>>>>>>>>>>>>> (different GPA but same translated HVA) exists in the 
> >>>>>>>>>>>>>>>>>> tree, as looking
> >>>>>>>>>>>>>>>>>> them by HVA will return them twice.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I think I don't understand if there's any side effect for 
> >>>>>>>>>>>>>>>>> shadow virtqueue?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> My bad, I totally forgot to put a reference to where this 
> >>>>>>>>>>>>>>>> comes from.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Si-Wei found that during initialization this sequences of 
> >>>>>>>>>>>>>>>> maps /
> >>>>>>>>>>>>>

Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree

2024-07-29 Thread Eugenio Perez Martin

On Wed, Jul 24, 2024 at 7:00 PM Jonah Palmer  wrote:
>
>
>
> On 5/13/24 11:56 PM, Jason Wang wrote:
> > On Mon, May 13, 2024 at 5:58 PM Eugenio Perez Martin
> >  wrote:
> >>
> >> On Mon, May 13, 2024 at 10:28 AM Jason Wang  wrote:
> >>>
> >>> On Mon, May 13, 2024 at 2:28 PM Eugenio Perez Martin
> >>>  wrote:
> >>>>
> >>>> On Sat, May 11, 2024 at 6:07 AM Jason Wang  wrote:
> >>>>>
> >>>>> On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin
> >>>>>  wrote:
> >>>>>>
> >>>>>> On Fri, May 10, 2024 at 6:29 AM Jason Wang  wrote:
> >>>>>>>
> >>>>>>> On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin 
> >>>>>>>  wrote:
> >>>>>>>>
> >>>>>>>> On Thu, May 9, 2024 at 8:27 AM Jason Wang  
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin 
> >>>>>>>>>  wrote:
> >>>>>>>>>>
> >>>>>>>>>> On Wed, May 8, 2024 at 4:29 AM Jason Wang  
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin 
> >>>>>>>>>>>  wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, May 7, 2024 at 9:29 AM Jason Wang  
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> >>>>>>>>>>>>>  wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Fri, Apr 12, 2024 at 8:47 AM Jason Wang 
> >>>>>>>>>>>>>>  wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez 
> >>>>>>>>>>>>>>>  wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> The guest may have overlapped memory regions, where 
> >>>>>>>>>>>>>>>> different GPA leads
> >>>>>>>>>>>>>>>> to the same HVA.  This causes a problem when overlapped 
> >>>>>>>>>>>>>>>> regions
> >>>>>>>>>>>>>>>> (different GPA but same translated HVA) exists in the tree, 
> >>>>>>>>>>>>>>>> as looking
> >>>>>>>>>>>>>>>> them by HVA will return them twice.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I think I don't understand if there's any side effect for 
> >>>>>>>>>>>>>>> shadow virtqueue?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> My bad, I totally forgot to put a reference to where this 
> >>>>>>>>>>>>>> comes from.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Si-Wei found that during initialization this sequences of maps 
> >>>>>>>>>>>>>> /
> >>>>>>>>>>>>>> unmaps happens [1]:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> HVAGPAIOVA
> >>>>>>>>>>>>>> -
> >>>>>>>>>>>>>> Map
> >>>>>>>>>>>>>> [0x7f7903e0, 0x7f7983e0)[0x0, 0x8000) [0x1000, 
> >>>>>>>>>>>>>> 0x8000)
> >>>>>>&

Re: [RFC v2 1/3] vhost: Introduce packed vq and add buffer elements

2024-07-29 Thread Eugenio Perez Martin

On Sun, Jul 28, 2024 at 7:37 PM Sahil  wrote:
>
> Hi,
>
> On Friday, July 26, 2024 7:18:28 PM GMT+5:30 Eugenio Perez Martin wrote:
> > On Fri, Jul 26, 2024 at 11:58 AM Sahil Siddiq  wrote:
> > > This is the first patch in a series to add support for packed
> > > virtqueues in vhost_shadow_virtqueue. This patch implements the
> > > insertion of available buffers in the descriptor area. It takes
> > > into account descriptor chains, but does not consider indirect
> > > descriptors.
> > >
> > > Signed-off-by: Sahil Siddiq 
> > > ---
> > > Changes v1 -> v2:
> > > * Split commit from RFC v1 into two commits.
> > > * vhost-shadow-virtqueue.c
> > >
> > >   (vhost_svq_add_packed):
> > >   - Merge with "vhost_svq_vring_write_descs_packed()"
> > >   - Remove "num == 0" check
> > >
> > >  hw/virtio/vhost-shadow-virtqueue.c | 93 +-
> > >  1 file changed, 92 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/hw/virtio/vhost-shadow-virtqueue.c
> > > b/hw/virtio/vhost-shadow-virtqueue.c index fc5f408f77..c7b7e0c477 100644
> > > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > > @@ -217,6 +217,91 @@ static bool vhost_svq_add_split(VhostShadowVirtqueue 
> > > *svq,
> > >  return true;
> > >
> > >  }
> > >
> > > +static bool vhost_svq_add_packed(VhostShadowVirtqueue *svq,
> > > +const struct iovec *out_sg, size_t 
> > > out_num,
> > > +const struct iovec *in_sg, size_t in_num,
> > > +unsigned *head)
> > > +{
> > > +bool ok;
> > > +uint16_t head_flags = 0;
> > > +g_autofree hwaddr *sgs = g_new(hwaddr, out_num + in_num);
> > > +
> > > +*head = svq->vring_packed.next_avail_idx;
> > > +
> > > +/* We need some descriptors here */
> > > +if (unlikely(!out_num && !in_num)) {
> > > +qemu_log_mask(LOG_GUEST_ERROR,
> > > +  "Guest provided element with no descriptors");
> > > +return false;
> > > +}
> > > +
> > > +uint16_t id, curr, i;
> > > +unsigned n;
> > > +struct vring_packed_desc *descs = svq->vring_packed.vring.desc;
> > > +
> > > +i = *head;
> > > +id = svq->free_head;
> > > +curr = id;
> > > +
> > > +size_t num = out_num + in_num;
> > > +
> > > +ok = vhost_svq_translate_addr(svq, sgs, out_sg, out_num);
> > > +if (unlikely(!ok)) {
> > > +return false;
> > > +}
> > > +
> > > +ok = vhost_svq_translate_addr(svq, sgs + out_num, in_sg, in_num);
> > > +if (unlikely(!ok)) {
> > > +return false;
> > > +}
> > > +
> >
> > (sorry I missed this from the RFC v1) I think all of the above should
> > be in the caller, isn't it? It is duplicated with split.
>
> I don't think this will be straightforward. While they perform the same 
> logical
> step in both cases, their implementation is a little different. For example, 
> the
> "sgs" pointer is created a little differently in both cases.

Do you mean because MAX() vs in_num+out_num? It is ok to convert both
to the latter.

> The parameters to
> "vhost_svq_translate_addr" is also a little different. I think if they are 
> moved to
> the caller, they will be in both "svq->is_packed" branches (in 
> "vhost_svq_add").
>

I don't see any difference apart from calling it with in and out sgs
separately or calling it for all of the array, am I missing something?

> > Also, declarations should be at the beginning of blocks per QEMU
> > coding style [1].
>
> Sorry, I missed this. I'll rectify this.
>

No worries!

You can run scripts/checkpatch.pl in QEMU for the next series, it
should catch many of these small issues.

Thanks!

Re: [RFC v2 0/3] Add packed virtqueue to shadow virtqueue

2024-07-26 Thread Eugenio Perez Martin

On Fri, Jul 26, 2024 at 7:11 PM Sahil  wrote:
>
> Hi,
>
> On Friday, July 26, 2024 7:10:24 PM GMT+5:30 Eugenio Perez Martin wrote:
> > On Fri, Jul 26, 2024 at 11:58 AM Sahil Siddiq  wrote:
> > > [...]
> > > Q1.
> > > In virtio_ring.h [2], new aliases with memory alignment enforcement
> > > such as "vring_desc_t" have been created. I am not sure if this
> > > is required for the packed vq descriptor ring (vring_packed_desc)
> > > as well. I don't see a type alias that enforces memory alignment
> > > for "vring_packed_desc" in the linux kernel. I haven't used any
> > > alias either.
> >
> > The alignment is required to be 16 for the descriptor ring and 4 for
> > the device and driver ares by the standard [1]. In QEMU, this is
> > solved by calling mmap, which always returns page-aligned addresses.
>
> Ok, I understand this now.
>
> > > Q2.
> > > I see that parts of the "vhost-vdpa" implementation is based on
> > > the assumption that SVQ uses the split vq format. For example,
> > > "vhost_vdpa_svq_map_rings" [3], calls "vhost_svq_device_area_size"
> > > which is specific to split vqs. The "vhost_vring_addr" [4] struct
> > > is also specific to split vqs.
> > >
> > > My idea is to have a generic "vhost_vring_addr" structure that
> > > wraps around split and packed vq specific structures, rather
> > > than using them directly in if-else conditions wherever the
> > > vhost-vdpa functions require their usage. However, this will
> > > involve checking their impact in several other places where this
> > > struct is currently being used (eg.: "vhost-user", "vhost-backend",
> > > "libvhost-user").
> >
> > Ok I've just found this is under-documented actually :).
> >
> > As you mention, vhost-user is already using this same struct for
> > packed vqs [2], just translating the driver area from the avail vring
> > and the device area from the used vring. So the best option is to
> > stick with that, unless I'm missing something.
> >
> >
> > [1] https://docs.oasis-open.org/virtio/virtio/v1.3/virtio-v1.3.html
> > [2]
> > https://github.com/DPDK/dpdk/blob/82c47f005b9a0a1e3a649664b7713443d18abe43/
> > lib/vhost/vhost_user.c#L841C1-L841C25
>
> Sorry, I am a little confused here. I was referring to QEMU's vhost-user
> implementation here.
>
> Based on what I have understood, "vhost_vring_addr" is only being used
> for split vqs in QEMU's vhost-user and in other places too. The implementation
> does not take into account packed vqs.
>
> I was going through DPDK's source. In DPDK's implementation of vhost-user [1],
> the same struct (vhost_virtqueue) is being used for split vqs and packed vqs. 
> This
> is possible since "vhost_virtqueue" [2] uses a union to wrap around the split 
> and
> packed versions of the vq.
>

Ok, now I get you better. Let me start again from a different angle :).

vhost_vring_addr is already part of the API that QEMU uses between
itself and vhost devices, all vhost-kernel, vhost-user and vhost-vdpa.
To make non-backward compatible changes to it is impossible, as it
involves changes in all of these elements.

QEMU and DPDK, using vhost-user, already send and receive packed
virtqueues addresses using the current structure layout. QEMU's
hw/virtio/vhost.c:vhost_virtqueue_set_addr already sets vq->desc,
vq->avail and vq->user, which has the values of the desc, driver and
device. In that sense, I recommend not to modify it.

On the other hand, DPDK's vhost_virtqueue is not the same struct as
vhost_vring_addr. It is internal to DPDK so it can be modified. We
need to do something similar for the SVQ, yes.

To do that union trick piece by piece in VhostShadowVirtqueue is
possible, but it requires modifying all the usages of the current
vring. I think it is easier for us to follow the kernel's
virtio_ring.c model, as it is a driver too, and create a vring_packed.
We can create an anonymous union and suffix all members with a _packed
so we don't need to modify current split usage.

Let me know what you think.

> > > My idea is to have a generic "vhost_vring_addr" structure that
> > > wraps around split and packed vq specific structures, rather
> > > than using them directly in if-else conditions wherever the
> > > vhost-vdpa functions require their usage. However, this will
> > > involve checking their impact in several other places where this
> > > struct is currently being used (eg.: "vhos

Re: [RFC v2 3/3] vhost: Allocate memory for packed vring.

2024-07-26 Thread Eugenio Perez Martin

On Fri, Jul 26, 2024 at 11:59 AM Sahil Siddiq  wrote:
>
> Allocate memory for the packed vq format and support
> packed vq in the SVQ "start" operation.
>
> Signed-off-by: Sahil Siddiq 
> ---
> Changes v1 -> v2:
> * vhost-shadow-virtqueue.h
>   (struct VhostShadowVirtqueue): New member "is_packed"
>   (vhost_svq_get_vring_addr): Renamed function.
>   (vhost_svq_get_vring_addr_packed): New function.
>   (vhost_svq_memory_packed): Likewise.
> * vhost-shadow-virtqueue.c:
>   (vhost_svq_add): Use "is_packed" to check vq format.
>   (vhost_svq_get_vring_addr): Rename function.
>   (vhost_svq_get_vring_addr_packed): New function but is yet to be 
> implemented.
>   (vhost_svq_memory_packed): New function.
>   (vhost_svq_start): Support packed vq format.
> * vhost-vdpa.c
>   (vhost_svq_get_vring_addr): Rename function.
>
>
>  hw/virtio/vhost-shadow-virtqueue.c | 70 ++
>  hw/virtio/vhost-shadow-virtqueue.h | 10 -
>  hw/virtio/vhost-vdpa.c |  4 +-
>  3 files changed, 63 insertions(+), 21 deletions(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c 
> b/hw/virtio/vhost-shadow-virtqueue.c
> index c7b7e0c477..045c07304c 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -343,7 +343,7 @@ int vhost_svq_add(VhostShadowVirtqueue *svq, const struct 
> iovec *out_sg,
>  return -ENOSPC;
>  }
>
> -if (virtio_vdev_has_feature(svq->vdev, VIRTIO_F_RING_PACKED)) {
> +if (svq->is_packed) {
>  ok = vhost_svq_add_packed(svq, out_sg, out_num,
>in_sg, in_num, &qemu_head);
>  } else {
> @@ -679,18 +679,29 @@ void vhost_svq_set_svq_call_fd(VhostShadowVirtqueue 
> *svq, int call_fd)
>  }
>
>  /**
> - * Get the shadow vq vring address.
> + * Get the split shadow vq vring address.
>   * @svq: Shadow virtqueue
>   * @addr: Destination to store address
>   */
> -void vhost_svq_get_vring_addr(const VhostShadowVirtqueue *svq,
> -  struct vhost_vring_addr *addr)
> +void vhost_svq_get_vring_addr_split(const VhostShadowVirtqueue *svq,
> +struct vhost_vring_addr *addr)
>  {
>  addr->desc_user_addr = (uint64_t)(uintptr_t)svq->vring.desc;
>  addr->avail_user_addr = (uint64_t)(uintptr_t)svq->vring.avail;
>  addr->used_user_addr = (uint64_t)(uintptr_t)svq->vring.used;
>  }
>
> +/**
> + * Get the packed shadow vq vring address.
> + * @svq: Shadow virtqueue
> + * @addr: Destination to store address
> + */
> +void vhost_svq_get_vring_addr_packed(const VhostShadowVirtqueue *svq,
> + struct vhost_vring_addr *addr)
> +{
> +/* TODO */
> +}
> +
>  size_t vhost_svq_driver_area_size(const VhostShadowVirtqueue *svq)
>  {
>  size_t desc_size = sizeof(vring_desc_t) * svq->vring.num;
> @@ -707,6 +718,16 @@ size_t vhost_svq_device_area_size(const 
> VhostShadowVirtqueue *svq)
>  return ROUND_UP(used_size, qemu_real_host_page_size());
>  }
>
> +size_t vhost_svq_memory_packed(const VhostShadowVirtqueue *svq)
> +{
> +size_t desc_size = sizeof(struct vring_packed_desc) * svq->num_free;
> +size_t driver_event_suppression = sizeof(struct vring_packed_desc_event);
> +size_t device_event_suppression = sizeof(struct vring_packed_desc_event);
> +
> +return ROUND_UP(desc_size + driver_event_suppression + 
> device_event_suppression,
> +qemu_real_host_page_size());
> +}
> +
>  /**
>   * Set a new file descriptor for the guest to kick the SVQ and notify for 
> avail
>   *
> @@ -759,19 +780,34 @@ void vhost_svq_start(VhostShadowVirtqueue *svq, 
> VirtIODevice *vdev,
>  svq->vq = vq;
>  svq->iova_tree = iova_tree;
>
> -svq->vring.num = virtio_queue_get_num(vdev, virtio_get_queue_index(vq));
> -svq->num_free = svq->vring.num;
> -svq->vring.desc = mmap(NULL, vhost_svq_driver_area_size(svq),
> -   PROT_READ | PROT_WRITE, MAP_SHARED | 
> MAP_ANONYMOUS,
> -   -1, 0);
> -desc_size = sizeof(vring_desc_t) * svq->vring.num;
> -svq->vring.avail = (void *)((char *)svq->vring.desc + desc_size);
> -svq->vring.used = mmap(NULL, vhost_svq_device_area_size(svq),
> -   PROT_READ | PROT_WRITE, MAP_SHARED | 
> MAP_ANONYMOUS,
> -   -1, 0);
> -svq->desc_state = g_new0(SVQDescState, svq->vring.num);
> -svq->desc_next = g_new0(uint16_t, svq->vring.num);
> -for (unsigned i = 0; i < svq->vring.num - 1; i++) {
> +if (virtio_vdev_has_feature(svq->vdev, VIRTIO_F_RING_PACKED)) {
> +svq->is_packed = true;
> +svq->vring_packed.vring.num = virtio_queue_get_num(vdev, 
> virtio_get_queue_index(vq));
> +svq->num_free = svq->vring_packed.vring.num;
> +svq->vring_packed.vring.desc = mmap(NULL, 
> vhost_svq_memory_packed(svq),
> +PROT_READ | PROT_WRITE, 
> MAP_SHARED | MAP_ANONYMOUS,
>

Re: [RFC v2 1/3] vhost: Introduce packed vq and add buffer elements

2024-07-26 Thread Eugenio Perez Martin

On Fri, Jul 26, 2024 at 11:58 AM Sahil Siddiq  wrote:
>
> This is the first patch in a series to add support for packed
> virtqueues in vhost_shadow_virtqueue. This patch implements the
> insertion of available buffers in the descriptor area. It takes
> into account descriptor chains, but does not consider indirect
> descriptors.
>
> Signed-off-by: Sahil Siddiq 
> ---
> Changes v1 -> v2:
> * Split commit from RFC v1 into two commits.
> * vhost-shadow-virtqueue.c
>   (vhost_svq_add_packed):
>   - Merge with "vhost_svq_vring_write_descs_packed()"
>   - Remove "num == 0" check
>
>  hw/virtio/vhost-shadow-virtqueue.c | 93 +-
>  1 file changed, 92 insertions(+), 1 deletion(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c 
> b/hw/virtio/vhost-shadow-virtqueue.c
> index fc5f408f77..c7b7e0c477 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -217,6 +217,91 @@ static bool vhost_svq_add_split(VhostShadowVirtqueue 
> *svq,
>  return true;
>  }
>
> +static bool vhost_svq_add_packed(VhostShadowVirtqueue *svq,
> +const struct iovec *out_sg, size_t out_num,
> +const struct iovec *in_sg, size_t in_num,
> +unsigned *head)
> +{
> +bool ok;
> +uint16_t head_flags = 0;
> +g_autofree hwaddr *sgs = g_new(hwaddr, out_num + in_num);
> +
> +*head = svq->vring_packed.next_avail_idx;
> +
> +/* We need some descriptors here */
> +if (unlikely(!out_num && !in_num)) {
> +qemu_log_mask(LOG_GUEST_ERROR,
> +  "Guest provided element with no descriptors");
> +return false;
> +}
> +
> +uint16_t id, curr, i;
> +unsigned n;
> +struct vring_packed_desc *descs = svq->vring_packed.vring.desc;
> +
> +i = *head;
> +id = svq->free_head;
> +curr = id;
> +
> +size_t num = out_num + in_num;
> +
> +ok = vhost_svq_translate_addr(svq, sgs, out_sg, out_num);
> +if (unlikely(!ok)) {
> +return false;
> +}
> +
> +ok = vhost_svq_translate_addr(svq, sgs + out_num, in_sg, in_num);
> +if (unlikely(!ok)) {
> +return false;
> +}
> +

(sorry I missed this from the RFC v1) I think all of the above should
be in the caller, isn't it? It is duplicated with split.

Also, declarations should be at the beginning of blocks per QEMU
coding style [1].

The rest looks good to me by visual inspection.

> +/* Write descriptors to SVQ packed vring */
> +for (n = 0; n < num; n++) {
> +uint16_t flags = cpu_to_le16(svq->vring_packed.avail_used_flags |
> + (n < out_num ? 0 : VRING_DESC_F_WRITE) |
> + (n + 1 == num ? 0 : VRING_DESC_F_NEXT));
> +if (i == *head) {
> +head_flags = flags;
> +} else {
> +descs[i].flags = flags;
> +}
> +
> +descs[i].addr = cpu_to_le64(sgs[n]);
> +descs[i].id = id;
> +if (n < out_num) {
> +descs[i].len = cpu_to_le32(out_sg[n].iov_len);
> +} else {
> +descs[i].len = cpu_to_le32(in_sg[n - out_num].iov_len);
> +}
> +
> +curr = cpu_to_le16(svq->desc_next[curr]);
> +
> +if (++i >= svq->vring_packed.vring.num) {
> +i = 0;
> +svq->vring_packed.avail_used_flags ^=
> +1 << VRING_PACKED_DESC_F_AVAIL |
> +1 << VRING_PACKED_DESC_F_USED;
> +}
> +}
> +
> +if (i <= *head) {
> +svq->vring_packed.avail_wrap_counter ^= 1;
> +}
> +
> +svq->vring_packed.next_avail_idx = i;
> +svq->free_head = curr;
> +
> +/*
> + * A driver MUST NOT make the first descriptor in the list
> + * available before all subsequent descriptors comprising
> + * the list are made available.
> + */
> +smp_wmb();
> +svq->vring_packed.vring.desc[*head].flags = head_flags;
> +
> +return true;
> +}
> +
>  static void vhost_svq_kick(VhostShadowVirtqueue *svq)
>  {
>  bool needs_kick;
> @@ -258,7 +343,13 @@ int vhost_svq_add(VhostShadowVirtqueue *svq, const 
> struct iovec *out_sg,
>  return -ENOSPC;
>  }
>
> -ok = vhost_svq_add_split(svq, out_sg, out_num, in_sg, in_num, 
> &qemu_head);
> +if (virtio_vdev_has_feature(svq->vdev, VIRTIO_F_RING_PACKED)) {
> +ok = vhost_svq_add_packed(svq, out_sg, out_num,
> +  in_sg, in_num, &qemu_head);
> +} else {
> +ok = vhost_svq_add_split(svq, out_sg, out_num,
> + in_sg, in_num, &qemu_head);
> +}
>  if (unlikely(!ok)) {
>  return -EINVAL;
>  }
> --
> 2.45.2
>

[1] https://www.qemu.org/docs/master/devel/style.html#declarations

Re: [RFC v2 0/3] Add packed virtqueue to shadow virtqueue

2024-07-26 Thread Eugenio Perez Martin

On Fri, Jul 26, 2024 at 11:58 AM Sahil Siddiq  wrote:
>
> Hi,
>
> I have made some progress in this project and thought I would
> send these changes first before continuing. I split patch v1 [1]
> into two commits (#1 and #2) to make it easy to review. There are
> very few changes in the first commit. The second commit has not
> changes.
>
> There are a few things that I am not entirely sure of in commit #3.
>
> Q1.
> In virtio_ring.h [2], new aliases with memory alignment enforcement
> such as "vring_desc_t" have been created. I am not sure if this
> is required for the packed vq descriptor ring (vring_packed_desc)
> as well. I don't see a type alias that enforces memory alignment
> for "vring_packed_desc" in the linux kernel. I haven't used any
> alias either.
>

The alignment is required to be 16 for the descriptor ring and 4 for
the device and driver ares by the standard [1]. In QEMU, this is
solved by calling mmap, which always returns page-aligned addresses.

> Q2.
> I see that parts of the "vhost-vdpa" implementation is based on
> the assumption that SVQ uses the split vq format. For example,
> "vhost_vdpa_svq_map_rings" [3], calls "vhost_svq_device_area_size"
> which is specific to split vqs. The "vhost_vring_addr" [4] struct
> is also specific to split vqs.
>
> My idea is to have a generic "vhost_vring_addr" structure that
> wraps around split and packed vq specific structures, rather
> than using them directly in if-else conditions wherever the
> vhost-vdpa functions require their usage. However, this will
> involve checking their impact in several other places where this
> struct is currently being used (eg.: "vhost-user", "vhost-backend",
> "libvhost-user").
>

Ok I've just found this is under-documented actually :).

As you mention, vhost-user is already using this same struct for
packed vqs [2], just translating the driver area from the avail vring
and the device area from the used vring. So the best option is to
stick with that, unless I'm missing something.

> Is this approach alright or is there a better alternative? I would
> like to get your thoughts on this before working on this portion of
> the project.
>
> Thanks,
> Sahil
>

[1] https://docs.oasis-open.org/virtio/virtio/v1.3/virtio-v1.3.html
[2] 
https://github.com/DPDK/dpdk/blob/82c47f005b9a0a1e3a649664b7713443d18abe43/lib/vhost/vhost_user.c#L841C1-L841C25

Re: [PATCH v4 6/6] virtio: Add VIRTIO_F_IN_ORDER property definition

2024-07-22 Thread Eugenio Perez Martin

On Mon, Jul 22, 2024 at 1:11 PM Eugenio Perez Martin
 wrote:
>
> On Sat, Jul 20, 2024 at 9:16 PM Michael S. Tsirkin  wrote:
> >
> > On Wed, Jul 10, 2024 at 08:55:19AM -0400, Jonah Palmer wrote:
> > > Extend the virtio device property definitions to include the
> > > VIRTIO_F_IN_ORDER feature.
> > >
> > > The default state of this feature is disabled, allowing it to be
> > > explicitly enabled where it's supported.
> > >
> > > Acked-by: Eugenio Pérez 
> > > Signed-off-by: Jonah Palmer 
> >
> >
> > Given release is close, it's likely wise.
> > However, I think we should flip the default in the future
> > release.
> >
>
> Should we post a new version with v9.2 tag enabling it?
>

Sorry, actually I think this needs some more thought. Maybe in_order
hurts the performance of devices that are usually out of order, like
blk. Should we enable only for virtio-net and let each device code
decide?

> > > ---
> > >  include/hw/virtio/virtio.h | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> > > index fdc827f82e..d2a1938757 100644
> > > --- a/include/hw/virtio/virtio.h
> > > +++ b/include/hw/virtio/virtio.h
> > > @@ -373,7 +373,9 @@ typedef struct VirtIORNGConf VirtIORNGConf;
> > >  DEFINE_PROP_BIT64("packed", _state, _field, \
> > >VIRTIO_F_RING_PACKED, false), \
> > >  DEFINE_PROP_BIT64("queue_reset", _state, _field, \
> > > -  VIRTIO_F_RING_RESET, true)
> > > +  VIRTIO_F_RING_RESET, true), \
> > > +DEFINE_PROP_BIT64("in_order", _state, _field, \
> > > +  VIRTIO_F_IN_ORDER, false)
> > >
> > >  hwaddr virtio_queue_get_desc_addr(VirtIODevice *vdev, int n);
> > >  bool virtio_queue_enabled_legacy(VirtIODevice *vdev, int n);
> > > --
> > > 2.43.5
> >

Re: [PATCH v4 6/6] virtio: Add VIRTIO_F_IN_ORDER property definition

2024-07-22 Thread Eugenio Perez Martin

On Sat, Jul 20, 2024 at 9:16 PM Michael S. Tsirkin  wrote:
>
> On Wed, Jul 10, 2024 at 08:55:19AM -0400, Jonah Palmer wrote:
> > Extend the virtio device property definitions to include the
> > VIRTIO_F_IN_ORDER feature.
> >
> > The default state of this feature is disabled, allowing it to be
> > explicitly enabled where it's supported.
> >
> > Acked-by: Eugenio Pérez 
> > Signed-off-by: Jonah Palmer 
>
>
> Given release is close, it's likely wise.
> However, I think we should flip the default in the future
> release.
>

Should we post a new version with v9.2 tag enabling it?

> > ---
> >  include/hw/virtio/virtio.h | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> > index fdc827f82e..d2a1938757 100644
> > --- a/include/hw/virtio/virtio.h
> > +++ b/include/hw/virtio/virtio.h
> > @@ -373,7 +373,9 @@ typedef struct VirtIORNGConf VirtIORNGConf;
> >  DEFINE_PROP_BIT64("packed", _state, _field, \
> >VIRTIO_F_RING_PACKED, false), \
> >  DEFINE_PROP_BIT64("queue_reset", _state, _field, \
> > -  VIRTIO_F_RING_RESET, true)
> > +  VIRTIO_F_RING_RESET, true), \
> > +DEFINE_PROP_BIT64("in_order", _state, _field, \
> > +  VIRTIO_F_IN_ORDER, false)
> >
> >  hwaddr virtio_queue_get_desc_addr(VirtIODevice *vdev, int n);
> >  bool virtio_queue_enabled_legacy(VirtIODevice *vdev, int n);
> > --
> > 2.43.5
>

Re: [PATCH] hw/virtio/vdpa-dev: Check returned value instead of dereferencing @errp

2024-07-16 Thread Eugenio Perez Martin

On Tue, Jul 16, 2024 at 5:05 AM Zhao Liu  wrote:
>
> On Mon, Jul 15, 2024 at 11:01:08PM +0200, Eugenio Perez Martin wrote:
> > Date: Mon, 15 Jul 2024 23:01:08 +0200
> > From: Eugenio Perez Martin 
> > Subject: Re: [PATCH] hw/virtio/vdpa-dev: Check returned value instead of
> >  dereferencing @errp
> >
> > On Mon, Jul 15, 2024 at 11:45 AM Zhao Liu  wrote:
> > >
> > > As the comment in qapi/error, dereferencing @errp requires
> > > ERRP_GUARD():
> > >
> > > * = Why, when and how to use ERRP_GUARD() =
> > > *
> > > * Without ERRP_GUARD(), use of the @errp parameter is restricted:
> > > * - It must not be dereferenced, because it may be null.
> > > ...
> > > * ERRP_GUARD() lifts these restrictions.
> > > *
> > > * To use ERRP_GUARD(), add it right at the beginning of the function.
> > > * @errp can then be used without worrying about the argument being
> > > * NULL or &error_fatal.
> > > *
> > > * Using it when it's not needed is safe, but please avoid cluttering
> > > * the source with useless code.
> > >
> > > Though vhost_vdpa_device_realize() is called at DeviceClass.realize()
> > > context and won't get NULL @errp, it's still better to follow the
> > > requirement to add the ERRP_GUARD().
> > >
> > > But qemu_open() and vhost_vdpa_device_get_u32()'s return values can
> > > distinguish between successful and unsuccessful calls, so check the
> > > return values directly without dereferencing @errp, which eliminates
> > > the need of ERRP_GUARD().
> > >
> > > Cc: "Michael S. Tsirkin" 
> > > Cc: "Eugenio Pérez" 
> > > Cc: Jason Wang 
> > > Signed-off-by: Zhao Liu 
> > > ---
> > >  hw/virtio/vdpa-dev.c | 11 ++-
> > >  1 file changed, 6 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/hw/virtio/vdpa-dev.c b/hw/virtio/vdpa-dev.c
> > > index 64b96b226c39..7b439efdc1d3 100644
> > > --- a/hw/virtio/vdpa-dev.c
> > > +++ b/hw/virtio/vdpa-dev.c
> > > @@ -50,6 +50,7 @@ vhost_vdpa_device_get_u32(int fd, unsigned long int 
> > > cmd, Error **errp)
> > >
> > >  static void vhost_vdpa_device_realize(DeviceState *dev, Error **errp)
> > >  {
> > > +ERRP_GUARD();
> >
> > Good catch, thank you! But removing the err dereferencing eliminates
> > the need for ERRP_GUARD(), doesn't it?
> >
>
> Thanks Eugenio! You're right and I forgot to delete it. I'll post a new
> version.
>
>

Good! With that removed,

Acked-by: Eugenio Pérez 

Thanks!

Re: [PATCH] hw/virtio/vdpa-dev: Check returned value instead of dereferencing @errp

2024-07-15 Thread Eugenio Perez Martin

On Mon, Jul 15, 2024 at 11:45 AM Zhao Liu  wrote:
>
> As the comment in qapi/error, dereferencing @errp requires
> ERRP_GUARD():
>
> * = Why, when and how to use ERRP_GUARD() =
> *
> * Without ERRP_GUARD(), use of the @errp parameter is restricted:
> * - It must not be dereferenced, because it may be null.
> ...
> * ERRP_GUARD() lifts these restrictions.
> *
> * To use ERRP_GUARD(), add it right at the beginning of the function.
> * @errp can then be used without worrying about the argument being
> * NULL or &error_fatal.
> *
> * Using it when it's not needed is safe, but please avoid cluttering
> * the source with useless code.
>
> Though vhost_vdpa_device_realize() is called at DeviceClass.realize()
> context and won't get NULL @errp, it's still better to follow the
> requirement to add the ERRP_GUARD().
>
> But qemu_open() and vhost_vdpa_device_get_u32()'s return values can
> distinguish between successful and unsuccessful calls, so check the
> return values directly without dereferencing @errp, which eliminates
> the need of ERRP_GUARD().
>
> Cc: "Michael S. Tsirkin" 
> Cc: "Eugenio Pérez" 
> Cc: Jason Wang 
> Signed-off-by: Zhao Liu 
> ---
>  hw/virtio/vdpa-dev.c | 11 ++-
>  1 file changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/hw/virtio/vdpa-dev.c b/hw/virtio/vdpa-dev.c
> index 64b96b226c39..7b439efdc1d3 100644
> --- a/hw/virtio/vdpa-dev.c
> +++ b/hw/virtio/vdpa-dev.c
> @@ -50,6 +50,7 @@ vhost_vdpa_device_get_u32(int fd, unsigned long int cmd, 
> Error **errp)
>
>  static void vhost_vdpa_device_realize(DeviceState *dev, Error **errp)
>  {
> +ERRP_GUARD();

Good catch, thank you! But removing the err dereferencing eliminates
the need for ERRP_GUARD(), doesn't it?

Thanks!

>  VirtIODevice *vdev = VIRTIO_DEVICE(dev);
>  VhostVdpaDevice *v = VHOST_VDPA_DEVICE(vdev);
>  struct vhost_vdpa_iova_range iova_range;
> @@ -63,19 +64,19 @@ static void vhost_vdpa_device_realize(DeviceState *dev, 
> Error **errp)
>  }
>
>  v->vhostfd = qemu_open(v->vhostdev, O_RDWR, errp);
> -if (*errp) {
> +if (v->vhostfd < 0) {
>  return;
>  }
>
>  v->vdev_id = vhost_vdpa_device_get_u32(v->vhostfd,
> VHOST_VDPA_GET_DEVICE_ID, errp);
> -if (*errp) {
> +if (v->vdev_id < 0) {
>  goto out;
>  }
>
>  max_queue_size = vhost_vdpa_device_get_u32(v->vhostfd,
> VHOST_VDPA_GET_VRING_NUM, 
> errp);
> -if (*errp) {
> +if (max_queue_size < 0) {
>  goto out;
>  }
>
> @@ -89,7 +90,7 @@ static void vhost_vdpa_device_realize(DeviceState *dev, 
> Error **errp)
>
>  v->num_queues = vhost_vdpa_device_get_u32(v->vhostfd,
>VHOST_VDPA_GET_VQS_COUNT, 
> errp);
> -if (*errp) {
> +if (v->num_queues < 0) {
>  goto out;
>  }
>
> @@ -127,7 +128,7 @@ static void vhost_vdpa_device_realize(DeviceState *dev, 
> Error **errp)
>  v->config_size = vhost_vdpa_device_get_u32(v->vhostfd,
> VHOST_VDPA_GET_CONFIG_SIZE,
> errp);
> -if (*errp) {
> +if (v->config_size < 0) {
>  goto vhost_cleanup;
>  }
>
> --
> 2.34.1
>

Re: [PATCH v4 4/6] virtio: virtqueue_ordered_flush - VIRTIO_F_IN_ORDER support

2024-07-10 Thread Eugenio Perez Martin

On Wed, Jul 10, 2024 at 2:56 PM Jonah Palmer  wrote:
>
> Add VIRTIO_F_IN_ORDER feature support for the virtqueue_flush operation.
>
> The goal of the virtqueue_ordered_flush operation when the
> VIRTIO_F_IN_ORDER feature has been negotiated is to write elements to
> the used/descriptor ring in-order and then update used_idx.
>
> The function iterates through the VirtQueueElement used_elems array
> in-order starting at vq->used_idx. If the element is valid (filled), the
> element is written to the used/descriptor ring. This process continues
> until we find an invalid (not filled) element.
>
> For packed VQs, the first entry (at vq->used_idx) is written to the
> descriptor ring last so the guest doesn't see any invalid descriptors.
>
> If any elements were written, the used_idx is updated.
>
> Signed-off-by: Jonah Palmer 

Acked-by: Eugenio Pérez 

> ---
> Several fixes here for the split VQ case:
> - Ensure all previous write operations to buffers are completed before
>   updating the used_idx (via smp_wmb()).
>
> - used_elems index 'i' should be incremented by the number of descriptors
>   in the current element we just processed, not by the running total of
>   descriptors already seen. This would've caused batched operations to
>   miss ordered elements when looping through the used_elems array.
>
> - Do not keep the VQ's used_idx bound between 0 and vring.num-1 when
>   setting it via vring_used_idx_set().
>
>   While the packed VQ case naturally keeps used_idx bound between 0 and
>   vring.num-1, the split VQ case cannot. This is because used_idx is
>   used to compare the current event index with the new and old used
>   indices to decide if a notification is necessary (see
>   virtio_split_should_notify()). This comparison expects used_idx to be
>   between 0 and 65535, not 0 and vring.num-1.
>
>  hw/virtio/virtio.c | 70 +-
>  1 file changed, 69 insertions(+), 1 deletion(-)
>
> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index a7b41c..b419d8d6e7 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -1023,6 +1023,72 @@ static void virtqueue_packed_flush(VirtQueue *vq, 
> unsigned int count)
>  }
>  }
>
> +static void virtqueue_ordered_flush(VirtQueue *vq)
> +{
> +unsigned int i = vq->used_idx % vq->vring.num;
> +unsigned int ndescs = 0;
> +uint16_t old = vq->used_idx;
> +uint16_t new;
> +bool packed;
> +VRingUsedElem uelem;
> +
> +packed = virtio_vdev_has_feature(vq->vdev, VIRTIO_F_RING_PACKED);
> +
> +if (packed) {
> +if (unlikely(!vq->vring.desc)) {
> +return;
> +}
> +} else if (unlikely(!vq->vring.used)) {
> +return;
> +}
> +
> +/* First expected in-order element isn't ready, nothing to do */
> +if (!vq->used_elems[i].in_order_filled) {
> +return;
> +}
> +
> +/* Search for filled elements in-order */
> +while (vq->used_elems[i].in_order_filled) {
> +/*
> + * First entry for packed VQs is written last so the guest
> + * doesn't see invalid descriptors.
> + */
> +if (packed && i != vq->used_idx) {
> +virtqueue_packed_fill_desc(vq, &vq->used_elems[i], ndescs, 
> false);
> +} else if (!packed) {
> +uelem.id = vq->used_elems[i].index;
> +uelem.len = vq->used_elems[i].len;
> +vring_used_write(vq, &uelem, i);
> +}
> +
> +vq->used_elems[i].in_order_filled = false;
> +ndescs += vq->used_elems[i].ndescs;
> +i += vq->used_elems[i].ndescs;
> +if (i >= vq->vring.num) {
> +i -= vq->vring.num;
> +}
> +}
> +
> +if (packed) {
> +virtqueue_packed_fill_desc(vq, &vq->used_elems[vq->used_idx], 0, 
> true);
> +vq->used_idx += ndescs;
> +if (vq->used_idx >= vq->vring.num) {
> +vq->used_idx -= vq->vring.num;
> +vq->used_wrap_counter ^= 1;
> +vq->signalled_used_valid = false;
> +}
> +} else {
> +/* Make sure buffer is written before we update index. */
> +smp_wmb();
> +new = old + ndescs;
> +vring_used_idx_set(vq, new);
> +if (unlikely((int16_t)(new - vq->signalled_used) < (uint16_t)(new - 
> old))) {
> +vq->signalled_used_valid = false;
> +}
> +}
> +vq->inuse -= ndescs;
> +}
> +
>  void virtqueue_flush(VirtQueue *vq, unsigned int count)
>  {
>  if (virtio_device_disabled(vq->vdev)) {
> @@ -1030,7 +1096,9 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
>  return;
>  }
>
> -if (virtio_vdev_has_feature(vq->vdev, VIRTIO_F_RING_PACKED)) {
> +if (virtio_vdev_has_feature(vq->vdev, VIRTIO_F_IN_ORDER)) {
> +virtqueue_ordered_flush(vq);
> +} else if (virtio_vdev_has_feature(vq->vdev, VIRTIO_F_RING_PACKED)) {
>  virtqueue_packed_flush(vq, count);
>  } else {
>  virtqueue_split_flush(vq, count

Re: [PATCH] virtio: remove virtio_tswap16s() call in vring_packed_event_read()

2024-07-01 Thread Eugenio Perez Martin

On Mon, Jul 1, 2024 at 9:52 AM Stefano Garzarella  wrote:
>
> Commit d152cdd6f6 ("virtio: use virtio accessor to access packed event")
> switched using of address_space_read_cached() to virito_lduw_phys_cached()
> to access packed descriptor event.
>
> When we used address_space_read_cached(), we needed to call
> virtio_tswap16s() to handle the endianess of the field, but
> virito_lduw_phys_cached() already handles it internally, so we no longer
> need to call virtio_tswap16s() (as the commit had done for `off_wrap`,
> but forgot for `flags`).
>
> Fixes: d152cdd6f6 ("virtio: use virtio accessor to access packed event")
> Cc: jasow...@redhat.com
> Cc: qemu-sta...@nongnu.org
> Reported-by: Xoykie 
> Link: 
> https://lore.kernel.org/qemu-devel/cafu8rb_pjr77zmlsm0unf9xpnxfr_--tjr49f_ex32zbc5o...@mail.gmail.com
> Signed-off-by: Stefano Garzarella 

Reviewed-by: Eugenio Pérez 

I think it would be great to test the patches using a big endian host
just in case.

Thanks!

> ---
>  hw/virtio/virtio.c | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index 893a072c9d..2e5e67bdb9 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -323,7 +323,6 @@ static void vring_packed_event_read(VirtIODevice *vdev,
>  /* Make sure flags is seen before off_wrap */
>  smp_rmb();
>  e->off_wrap = virtio_lduw_phys_cached(vdev, cache, off_off);
> -virtio_tswap16s(vdev, &e->flags);
>  }
>
>  static void vring_packed_off_wrap_write(VirtIODevice *vdev,
> --
> 2.45.2
>

Re: [RFC] vhost: Introduce packed vq and add buffer elements

2024-06-24 Thread Eugenio Perez Martin

On Sat, Jun 22, 2024 at 6:34 AM Sahil  wrote:
>
> Hi,
>
> On Wednesday, June 19, 2024 3:49:29 PM GMT+5:30 Eugenio Perez Martin wrote:
> > [...]
> > Hi Sahil,
> >
> > Just some nitpicks here and there,
> >
> > > [1] https://wiki.qemu.org/Internships/ProjectIdeas/PackedShadowVirtqueue
> > >
> > >  hw/virtio/vhost-shadow-virtqueue.c | 124 -
> > >  hw/virtio/vhost-shadow-virtqueue.h |  66 ++-
> > >  2 files changed, 167 insertions(+), 23 deletions(-)
> > >
> > > diff --git a/hw/virtio/vhost-shadow-virtqueue.c
> > > b/hw/virtio/vhost-shadow-virtqueue.c index fc5f408f77..e3b276a9e9 100644
> > > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > > @@ -217,6 +217,122 @@ static bool 
> > > vhost_svq_add_split(VhostShadowVirtqueue *svq,
> > >  return true;
> > >  }
> > >
> > > +/**
> > > + * Write descriptors to SVQ packed vring
> > > + *
> > > + * @svq: The shadow virtqueue
> > > + * @sg: Cache for hwaddr
> > > + * @out_sg: The iovec from the guest that is read-only for device
> > > + * @out_num: iovec length
> > > + * @in_sg: The iovec from the guest that is write-only for device
> > > + * @in_num: iovec length
> > > + * @head_flags: flags for first descriptor in list
> > > + *
> > > + * Return true if success, false otherwise and print error.
> > > + */
> > > +static bool vhost_svq_vring_write_descs_packed(VhostShadowVirtqueue 
> > > *svq, hwaddr *sg,
> > > +const struct iovec *out_sg, 
> > > size_t out_num,
> > > +const struct iovec *in_sg, 
> > > size_t in_num,
> > > +uint16_t *head_flags)
> > > +{
> > > +uint16_t id, curr, head, i;
> > > +unsigned n;
> > > +struct vring_packed_desc *descs = svq->vring_packed.vring.desc;
> > > +bool ok;
> > > +
> > > +head = svq->vring_packed.next_avail_idx;
> > > +i = head;
> > > +id = svq->free_head;
> > > +curr = id;
> > > +
> > > +size_t num = out_num + in_num;
> > > +
> > > +if (num == 0) {
> > > +return true;
> > > +}
> >
> > num == 0 is impossible now, the caller checks for that.
>
> Oh yes, I missed that.
>
> >
> > > +
> > > +ok = vhost_svq_translate_addr(svq, sg, out_sg, out_num);
> > > +if (unlikely(!ok)) {
> > > +return false;
> > > +}
> > > +
> > > +ok = vhost_svq_translate_addr(svq, sg + out_num, in_sg, in_num);
> > > +if (unlikely(!ok)) {
> > > +return false;
> > > +}
> > > +
> > > +for (n = 0; n < num; n++) {
> > > +uint16_t flags = cpu_to_le16(svq->vring_packed.avail_used_flags |
> > > +(n < out_num ? 0 : VRING_DESC_F_WRITE) |
> > > +(n + 1 == num ? 0 : VRING_DESC_F_NEXT));
> > > +if (i == head) {
> > > +*head_flags = flags;
> > > +} else {
> > > +descs[i].flags = flags;
> > > +}
> > > +
> > > +descs[i].addr = cpu_to_le64(sg[n]);
> > > +descs[i].id = id;
> > > +if (n < out_num) {
> > > +descs[i].len = cpu_to_le32(out_sg[n].iov_len);
> > > +} else {
> > > +descs[i].len = cpu_to_le32(in_sg[n - out_num].iov_len);
> > > +}
> > > +
> > > +curr = cpu_to_le16(svq->desc_next[curr]);
> > > +
> > > +if (++i >= svq->vring_packed.vring.num) {
> > > +i = 0;
> > > +svq->vring_packed.avail_used_flags ^=
> > > +1 << VRING_PACKED_DESC_F_AVAIL |
> > > +1 << VRING_PACKED_DESC_F_USED;
> > > +}
> > > +}
> > > +
> > > +if (i <= head) {
> > > +svq->vring_packed.avail_wrap_counter ^= 1;
> > > +}
> > > +
> > > +svq->vring_packed.next_avail_idx = i;
> > > +svq->free_head = curr;
> > > +return true;
> > > +}
> > > +
> > > +static bool vhost_svq_add_packed(VhostShadowVirtqueue *svq,
> > > +

Re: [PATCH v3 0/6] virtio,vhost: Add VIRTIO_F_IN_ORDER support

2024-06-20 Thread Eugenio Perez Martin

On Thu, Jun 20, 2024 at 7:56 PM Jonah Palmer  wrote:
>
> The goal of these patches is to add support to a variety of virtio and
> vhost devices for the VIRTIO_F_IN_ORDER transport feature. This feature
> indicates that all buffers are used by the device in the same order in
> which they were made available by the driver.
>
> These patches attempt to implement a generalized, non-device-specific
> solution to support this feature.
>
> The core feature behind this solution is a buffer mechanism in the form
> of a VirtQueue's used_elems VirtQueueElement array. This allows devices
> who always use buffers in-order by default to have a minimal overhead
> impact. Devices that may not always use buffers in-order likely will
> experience a performance hit. How large that performance hit is will
> depend on how frequently elements are completed out-of-order.
>
> A VirtQueue whose device uses this feature will use its used_elems
> VirtQueueElement array to hold used VirtQueueElements. The index that
> used elements are placed in used_elems is the same index on the
> used/descriptor ring that would satisfy the in-order requirement. In
> other words, used elements are placed in their in-order locations on
> used_elems and are only written to the used/descriptor ring once the
> elements on used_elems are able to continue their expected order.
>
> To differentiate between a "used" and "unused" element on the used_elems
> array (a "used" element being an element that has returned from
> processing and an "unused" element being an element that has not yet
> been processed), we added a boolean 'in_order_filled' member to the
> VirtQueueElement struct. This flag is set to true when the element comes
> back from processing (virtqueue_ordered_fill) and then set back to false
> once it's been written to the used/descriptor ring
> (virtqueue_ordered_flush).
>
> Testing:
> 
> Testing was done using the dpdk-testpmd application on both the host and
> guest using the following configurations. Traffic was generated between
> the host and guest after running 'start tx_first' on both the host and
> guest dpdk-testpmd applications. Results are below after traffic was
> generated for several seconds.
>
> Relevant Qemu args:
> ---
> -chardev socket,id=char1,path=/tmp/vhost-user1,server=off
> -chardev socket,id=char2,path=/tmp/vhost-user2,server=off
> -netdev type=vhost-user,id=net1,chardev=char1,vhostforce=on,queues=1
> -netdev type=vhost-user,id=net2,chardev=char2,vhostforce=on,queues=1
> -device virtio-net-pci,in_order=true,packed=true,netdev=net1,
> mac=56:48:4f:53:54:00,mq=on,vectors=4,rx_queue_size=256
> -device virtio-net-pci,in_order=true,packed=true,netdev=net2,
> mac=56:48:4f:53:54:01,mq=on,vectors=4,rx_queue_size=256
>

Hi Jonah,

These tests are great, but others should also be performed. In
particular, QEMU should run ok with "tap" netdev with vhost=off
instead of vhost-user:

-netdev type=tap,id=net1,vhost=off
-netdev type=tap,id=net2,vhost=off

This way, packets are going through the modified code. With this
configuration, QEMU is the one forwarding the packets so testpmd is
not needed in the host. It's still needed in the guest as linux guest
driver does not support in_order. The guest kernel cmdline and testpmd
cmdline should require no changes from the configuration you describe
here.

And then try with in_order=true,packed=false and
in_order=true,packed=off in corresponding virtio-net-pci.

Performance comparison between in_order=true and in_order=false is
also interesting but we're not batching so I don't think we will get
an extreme improvement.

Does the plan work for you?

Thanks!

> Host dpdk-testpmd command:
> --
> dpdk-testpmd -l 0,2,3,4,5 --socket-mem=1024 -n 4
> --vdev 'net_vhost0,iface=/tmp/vhost-user1'
> --vdev 'net_vhost1,iface=/tmp/vhost-user2' --
> --portmask=f -i --rxq=1 --txq=1 --nb-cores=4 --forward-mode=io
>
> Guest dpdk-testpmd command:
> ---
> dpdk-testpmd -l 0,1 -a :00:02.0 -a :00:03.0 -- --portmask=3
> --rxq=1 --txq=1 --nb-cores=1 --forward-mode=io -i
>
> Results:
> 
> +++ Accumulated forward statistics for all ports+++
> RX-packets: 79067488   RX-dropped: 0 RX-total: 79067488
> TX-packets: 79067552   TX-dropped: 0 TX-total: 79067552
> 
>
> ---
> v3: Drop Tested-by tags until patches are re-tested.
> Replace 'prev_avail_idx' with 'vq->last_avail_idx - 1' in
> virtqueue_split_pop.
> Remove redundant '+vq->vring.num' in 'max_steps' calculation in
> virtqueue_ordered_fill.
> Add test results to CV.
>
> v2: Make 'in_order_filled' more descriptive.
> Change 'j' to more descriptive var name in virtqueue_split_pop.
> Use more definitive search conditional in virtqueue_ordered_fill.
> Avoid code duplication in virtqueue_ordered

Re: [RFC] vhost: Introduce packed vq and add buffer elements

2024-06-19 Thread Eugenio Perez Martin

On Tue, Jun 18, 2024 at 8:19 PM Sahil Siddiq  wrote:
>
> This is the first patch in a series to add support for packed
> virtqueues in vhost_shadow_virtqueue. This patch implements the
> insertion of available buffers in the descriptor area. It takes
> into account descriptor chains, but does not consider indirect
> descriptors.
>
> VhostShadowVirtqueue has also been modified so it acts as a layer
> of abstraction for split and packed virtqueues.
>
> Signed-off-by: Sahil Siddiq 
> ---
> Hi,
>
> I am currently working on adding support for packed virtqueues in
> vhost_shadow_virtqueue [1]. This patch only implements the insertion of
> available buffers in the descriptor area. It does not take into
> account indirect descriptors, event_idx or notifications.
>
> I don't think these changes are testable yet but I thought I would
> still post this patch for feedback. The following email annotates these
> changes with a few comments and questions that I have.
>
> Thanks,
> Sahil
>

Hi Sahil,

Just some nitpicks here and there,

> [1] https://wiki.qemu.org/Internships/ProjectIdeas/PackedShadowVirtqueue
>
>  hw/virtio/vhost-shadow-virtqueue.c | 124 -
>  hw/virtio/vhost-shadow-virtqueue.h |  66 ++-
>  2 files changed, 167 insertions(+), 23 deletions(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c 
> b/hw/virtio/vhost-shadow-virtqueue.c
> index fc5f408f77..e3b276a9e9 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -217,6 +217,122 @@ static bool vhost_svq_add_split(VhostShadowVirtqueue 
> *svq,
>  return true;
>  }
>
> +/**
> + * Write descriptors to SVQ packed vring
> + *
> + * @svq: The shadow virtqueue
> + * @sg: Cache for hwaddr
> + * @out_sg: The iovec from the guest that is read-only for device
> + * @out_num: iovec length
> + * @in_sg: The iovec from the guest that is write-only for device
> + * @in_num: iovec length
> + * @head_flags: flags for first descriptor in list
> + *
> + * Return true if success, false otherwise and print error.
> + */
> +static bool vhost_svq_vring_write_descs_packed(VhostShadowVirtqueue *svq, 
> hwaddr *sg,
> +const struct iovec *out_sg, size_t 
> out_num,
> +const struct iovec *in_sg, size_t 
> in_num,
> +uint16_t *head_flags)
> +{
> +uint16_t id, curr, head, i;
> +unsigned n;
> +struct vring_packed_desc *descs = svq->vring_packed.vring.desc;
> +bool ok;
> +
> +head = svq->vring_packed.next_avail_idx;
> +i = head;
> +id = svq->free_head;
> +curr = id;
> +
> +size_t num = out_num + in_num;
> +
> +if (num == 0) {
> +return true;
> +}

num == 0 is impossible now, the caller checks for that.

> +
> +ok = vhost_svq_translate_addr(svq, sg, out_sg, out_num);
> +if (unlikely(!ok)) {
> +return false;
> +}
> +
> +ok = vhost_svq_translate_addr(svq, sg + out_num, in_sg, in_num);
> +if (unlikely(!ok)) {
> +return false;
> +}
> +
> +for (n = 0; n < num; n++) {
> +uint16_t flags = cpu_to_le16(svq->vring_packed.avail_used_flags |
> +(n < out_num ? 0 : VRING_DESC_F_WRITE) |
> +(n + 1 == num ? 0 : VRING_DESC_F_NEXT));
> +if (i == head) {
> +*head_flags = flags;
> +} else {
> +descs[i].flags = flags;
> +}
> +
> +descs[i].addr = cpu_to_le64(sg[n]);
> +descs[i].id = id;
> +if (n < out_num) {
> +descs[i].len = cpu_to_le32(out_sg[n].iov_len);
> +} else {
> +descs[i].len = cpu_to_le32(in_sg[n - out_num].iov_len);
> +}
> +
> +curr = cpu_to_le16(svq->desc_next[curr]);
> +
> +if (++i >= svq->vring_packed.vring.num) {
> +i = 0;
> +svq->vring_packed.avail_used_flags ^=
> +1 << VRING_PACKED_DESC_F_AVAIL |
> +1 << VRING_PACKED_DESC_F_USED;
> +}
> +}
> +
> +if (i <= head) {
> +svq->vring_packed.avail_wrap_counter ^= 1;
> +}
> +
> +svq->vring_packed.next_avail_idx = i;
> +svq->free_head = curr;
> +return true;
> +}
> +
> +static bool vhost_svq_add_packed(VhostShadowVirtqueue *svq,
> +const struct iovec *out_sg, size_t out_num,
> +const struct iovec *in_sg, size_t in_num,
> +unsigned *head)
> +{
> +bool ok;
> +uint16_t head_flags = 0;
> +g_autofree hwaddr *sgs = g_new(hwaddr, out_num + in_num);
> +
> +*head = svq->vring_packed.next_avail_idx;
> +
> +/* We need some descriptors here */
> +if (unlikely(!out_num && !in_num)) {
> +qemu_log_mask(LOG_GUEST_ERROR,
> +  "Guest provided element with no descriptors");
> +return false;
> +}
> +
> +ok = vhost_svq_vring_write_descs_packed(

Re: [RFC] vhost: Introduce packed vq and add buffer elements

2024-06-19 Thread Eugenio Perez Martin

On Tue, Jun 18, 2024 at 8:58 PM Sahil  wrote:
>
> Hi,
>
> On Tuesday, June 18, 2024 11:48:34 PM GMT+5:30 Sahil Siddiq wrote:
> > [...]
> >
> >  hw/virtio/vhost-shadow-virtqueue.c | 124 -
> >  hw/virtio/vhost-shadow-virtqueue.h |  66 ++-
> >  2 files changed, 167 insertions(+), 23 deletions(-)
> >
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.c 
> > b/hw/virtio/vhost-shadow-virtqueue.c
> > index fc5f408f77..e3b276a9e9 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > @@ -217,6 +217,122 @@ static bool vhost_svq_add_split(VhostShadowVirtqueue 
> > *svq,
> >  return true;
> >  }
> >
> > +/**
> > + * Write descriptors to SVQ packed vring
> > + *
> > + * @svq: The shadow virtqueue
> > + * @sg: Cache for hwaddr
> > + * @out_sg: The iovec from the guest that is read-only for device
> > + * @out_num: iovec length
> > + * @in_sg: The iovec from the guest that is write-only for device
> > + * @in_num: iovec length
> > + * @head_flags: flags for first descriptor in list
> > + *
> > + * Return true if success, false otherwise and print error.
> > + */
> > +static bool vhost_svq_vring_write_descs_packed(VhostShadowVirtqueue *svq, 
> > hwaddr *sg,
> > +const struct iovec *out_sg, size_t 
> > out_num,
> > +const struct iovec *in_sg, size_t 
> > in_num,
> > +uint16_t *head_flags)
> > +{
> > +uint16_t id, curr, head, i;
> > +unsigned n;
> > +struct vring_packed_desc *descs = svq->vring_packed.vring.desc;
> > +bool ok;
> > +
> > +head = svq->vring_packed.next_avail_idx;
> > +i = head;
> > +id = svq->free_head;
> > +curr = id;
> > +
> > +size_t num = out_num + in_num;
> > +
> > +if (num == 0) {
> > +return true;
> > +}
> > +
> > +ok = vhost_svq_translate_addr(svq, sg, out_sg, out_num);
> > +if (unlikely(!ok)) {
> > +return false;
> > +}
> > +
> > +ok = vhost_svq_translate_addr(svq, sg + out_num, in_sg, in_num);
> > +if (unlikely(!ok)) {
> > +return false;
> > +}
> > +
> > +for (n = 0; n < num; n++) {
> > +uint16_t flags = cpu_to_le16(svq->vring_packed.avail_used_flags |
> > +(n < out_num ? 0 : VRING_DESC_F_WRITE) |
> > +(n + 1 == num ? 0 : VRING_DESC_F_NEXT));
> > +if (i == head) {
> > +*head_flags = flags;
> > +} else {
> > +descs[i].flags = flags;
> > +}
> > +
> > +descs[i].addr = cpu_to_le64(sg[n]);
> > +descs[i].id = id;
> > +if (n < out_num) {
> > +descs[i].len = cpu_to_le32(out_sg[n].iov_len);
> > +} else {
> > +descs[i].len = cpu_to_le32(in_sg[n - out_num].iov_len);
> > +}
> > +
> > +curr = cpu_to_le16(svq->desc_next[curr]);
>
> "curr" is being updated here, but descs[i].id is always set to id which 
> doesn't change in
> the loop. So all the descriptors in the chain will have the same id. I can't 
> find anything
> in the virtio specification [1] that suggests that all descriptors in the 
> chain have the same
> id. Also, going by the figure captioned "Three chained descriptors available" 
> in the blog
> post on packed virtqueues [2], it looks like the descriptors in the chain 
> have different
> buffer ids.
>
> The virtio implementation in Linux also reuses the same id value for all the 
> descriptors in a
> single chain. I am not sure if I am missing something here.
>

The code is right, the id that identifies the whole chain is just the
one on the last descriptor. The key is that all the tail descriptors
of the chains will have a different id, the rest ids are ignored so it
is easier this way. I got it wrong in a recent mail in the list, where
you can find more information. Let me know if you cannot find it :).

In the split vq is different as a chained descriptor can go back and
forth in the descriptor ring with the next id. So all of them must be
different. But in the packed vq, the device knows the next descriptor
is placed at the next entry in the descriptor ring, so the only
important id is the last one.

> > +if (++i >= svq->vring_packed.vring.num) {
> > +i = 0;
> > +svq->vring_packed.avail_used_flags ^=
> > +1 << VRING_PACKED_DESC_F_AVAIL |
> > +1 << VRING_PACKED_DESC_F_USED;
> > +}
> > +}
> > +
> > +if (i <= head) {
> > +svq->vring_packed.avail_wrap_counter ^= 1;
> > +}
> > +
> > +svq->vring_packed.next_avail_idx = i;
> > +svq->free_head = curr;
>
> Even though the same id is used, curr will not be id+1 here.
>

curr is not the descriptor index, but the id. They're used in a stack
format: One available chain pops an id and one used id pushes its id
in the stack.

Maybe I'm wrong, but I think the main reason is to reuse the same
memory

[PATCH] hmp-commands-info.hx: Add missing info command for stats subcommand

2024-06-15 Thread Martin Joerg

Signed-off-by: Martin Joerg 
---
 hmp-commands-info.hx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hmp-commands-info.hx b/hmp-commands-info.hx
index 20a9835ea8..f5639af517 100644
--- a/hmp-commands-info.hx
+++ b/hmp-commands-info.hx
@@ -892,7 +892,7 @@ ERST
 },
 
 SRST
-  ``stats``
+  ``info stats``
 Show runtime-collected statistics
 ERST

Re: [PATCH v2 3/6] virtio: virtqueue_ordered_fill - VIRTIO_F_IN_ORDER support

2024-05-23 Thread Eugenio Perez Martin

On Thu, May 23, 2024 at 12:30 PM Jonah Palmer  wrote:
>
>
>
> On 5/22/24 12:07 PM, Eugenio Perez Martin wrote:
> > On Mon, May 20, 2024 at 3:01 PM Jonah Palmer  
> > wrote:
> >>
> >> Add VIRTIO_F_IN_ORDER feature support for the virtqueue_fill operation.
> >>
> >> The goal of the virtqueue_ordered_fill operation when the
> >> VIRTIO_F_IN_ORDER feature has been negotiated is to search for this
> >> now-used element, set its length, and mark the element as filled in
> >> the VirtQueue's used_elems array.
> >>
> >> By marking the element as filled, it will indicate that this element has
> >> been processed and is ready to be flushed, so long as the element is
> >> in-order.
> >>
> >> Signed-off-by: Jonah Palmer 
> >> ---
> >>   hw/virtio/virtio.c | 36 +++-
> >>   1 file changed, 35 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >> index 7456d61bc8..01b6b32460 100644
> >> --- a/hw/virtio/virtio.c
> >> +++ b/hw/virtio/virtio.c
> >> @@ -873,6 +873,38 @@ static void virtqueue_packed_fill(VirtQueue *vq, 
> >> const VirtQueueElement *elem,
> >>   vq->used_elems[idx].ndescs = elem->ndescs;
> >>   }
> >>
> >> +static void virtqueue_ordered_fill(VirtQueue *vq, const VirtQueueElement 
> >> *elem,
> >> +   unsigned int len)
> >> +{
> >> +unsigned int i, steps, max_steps;
> >> +
> >> +i = vq->used_idx;
> >> +steps = 0;
> >> +/*
> >> + * We shouldn't need to increase 'i' by more than the distance
> >> + * between used_idx and last_avail_idx.
> >> + */
> >> +max_steps = (vq->last_avail_idx + vq->vring.num - vq->used_idx)
> >> +% vq->vring.num;
> >
> > I may be missing something, but (+vq->vring.num) is redundant if we (%
> > vq->vring.num), isn't it?
> >
>
> It ensures the result is always non-negative (e.g. when
> vq->last_avail_idx < vq->used_idx).
>
> I wasn't sure how different platforms or compilers would handle
> something like -5 % 10, so to be safe I included the '+ vq->vring.num'.
>
> For example, on my system, in test.c;
>
> #include 
>
> int main() {
> unsigned int result = -5 % 10;
> printf("Result of -5 %% 10 is: %d\n", result);
> return 0;
> }
>
> # gcc -o test test.c
>
> # ./test
> Result of -5 % 10 is: -5
>

I think the modulo is being done in signed ints in your test, and then
converting a signed int to an unsigned int. Like result = (-5 % 10).

The unsigned wrap is always defined in C, and vq->last_avail_idx and
vq->used_idx are both unsigned. Here is a closer test:
int main(void) {
unsigned int a = -5, b = 2;
unsigned int result = (b-a) % 10;
printf("Result of -5 %% 10 is: %u\n", result);
return 0;
}

But it is a good catch for signed ints for sure :).

Thanks!

> >> +
> >> +/* Search for element in vq->used_elems */
> >> +while (steps <= max_steps) {
> >> +/* Found element, set length and mark as filled */
> >> +if (vq->used_elems[i].index == elem->index) {
> >> +vq->used_elems[i].len = len;
> >> +vq->used_elems[i].in_order_filled = true;
> >> +break;
> >> +}
> >> +
> >> +i += vq->used_elems[i].ndescs;
> >> +steps += vq->used_elems[i].ndescs;
> >> +
> >> +if (i >= vq->vring.num) {
> >> +i -= vq->vring.num;
> >> +}
> >> +}
> >> +}
> >> +
> >
> > Let's report an error if we finish the loop. I think:
> > qemu_log_mask(LOG_GUEST_ERROR,
> >"%s: %s cannot fill buffer id %u\n",
> >__func__, vdev->name, elem->index);
> >
> > (or similar) should do.
> >
> > apart form that,
> >
> > Reviewed-by: Eugenio Pérez 
> >
>
> Gotcha. Will add this in v3.
>
> Thank you Eugenio!
>
> >>   static void virtqueue_packed_fill_desc(VirtQueue *vq,
> >>  const VirtQueueElement *elem,
> >>  unsigned int idx,
> >> @@ -923,7 +955,9 @@ void virtqueue_fill(VirtQueue *vq, const 
> >> VirtQueueElement *elem,
> >>   return;
> >>   }
> >>
> >> -if (virtio_vdev_has_feature(vq->vdev, VIRTIO_F_RING_PACKED)) {
> >> +if (virtio_vdev_has_feature(vq->vdev, VIRTIO_F_IN_ORDER)) {
> >> +virtqueue_ordered_fill(vq, elem, len);
> >> +} else if (virtio_vdev_has_feature(vq->vdev, VIRTIO_F_RING_PACKED)) {
> >>   virtqueue_packed_fill(vq, elem, len, idx);
> >>   } else {
> >>   virtqueue_split_fill(vq, elem, len, idx);
> >> --
> >> 2.39.3
> >>
> >
>

Re: [PATCH v2 3/6] virtio: virtqueue_ordered_fill - VIRTIO_F_IN_ORDER support

2024-05-22 Thread Eugenio Perez Martin

On Mon, May 20, 2024 at 3:01 PM Jonah Palmer  wrote:
>
> Add VIRTIO_F_IN_ORDER feature support for the virtqueue_fill operation.
>
> The goal of the virtqueue_ordered_fill operation when the
> VIRTIO_F_IN_ORDER feature has been negotiated is to search for this
> now-used element, set its length, and mark the element as filled in
> the VirtQueue's used_elems array.
>
> By marking the element as filled, it will indicate that this element has
> been processed and is ready to be flushed, so long as the element is
> in-order.
>
> Signed-off-by: Jonah Palmer 
> ---
>  hw/virtio/virtio.c | 36 +++-
>  1 file changed, 35 insertions(+), 1 deletion(-)
>
> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index 7456d61bc8..01b6b32460 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -873,6 +873,38 @@ static void virtqueue_packed_fill(VirtQueue *vq, const 
> VirtQueueElement *elem,
>  vq->used_elems[idx].ndescs = elem->ndescs;
>  }
>
> +static void virtqueue_ordered_fill(VirtQueue *vq, const VirtQueueElement 
> *elem,
> +   unsigned int len)
> +{
> +unsigned int i, steps, max_steps;
> +
> +i = vq->used_idx;
> +steps = 0;
> +/*
> + * We shouldn't need to increase 'i' by more than the distance
> + * between used_idx and last_avail_idx.
> + */
> +max_steps = (vq->last_avail_idx + vq->vring.num - vq->used_idx)
> +% vq->vring.num;

I may be missing something, but (+vq->vring.num) is redundant if we (%
vq->vring.num), isn't it?

> +
> +/* Search for element in vq->used_elems */
> +while (steps <= max_steps) {
> +/* Found element, set length and mark as filled */
> +if (vq->used_elems[i].index == elem->index) {
> +vq->used_elems[i].len = len;
> +vq->used_elems[i].in_order_filled = true;
> +break;
> +}
> +
> +i += vq->used_elems[i].ndescs;
> +steps += vq->used_elems[i].ndescs;
> +
> +if (i >= vq->vring.num) {
> +i -= vq->vring.num;
> +}
> +}
> +}
> +

Let's report an error if we finish the loop. I think:
qemu_log_mask(LOG_GUEST_ERROR,
  "%s: %s cannot fill buffer id %u\n",
  __func__, vdev->name, elem->index);

(or similar) should do.

apart form that,

Reviewed-by: Eugenio Pérez 

>  static void virtqueue_packed_fill_desc(VirtQueue *vq,
> const VirtQueueElement *elem,
> unsigned int idx,
> @@ -923,7 +955,9 @@ void virtqueue_fill(VirtQueue *vq, const VirtQueueElement 
> *elem,
>  return;
>  }
>
> -if (virtio_vdev_has_feature(vq->vdev, VIRTIO_F_RING_PACKED)) {
> +if (virtio_vdev_has_feature(vq->vdev, VIRTIO_F_IN_ORDER)) {
> +virtqueue_ordered_fill(vq, elem, len);
> +} else if (virtio_vdev_has_feature(vq->vdev, VIRTIO_F_RING_PACKED)) {
>  virtqueue_packed_fill(vq, elem, len, idx);
>  } else {
>  virtqueue_split_fill(vq, elem, len, idx);
> --
> 2.39.3
>

Re: [PATCH v2 2/6] virtio: virtqueue_pop - VIRTIO_F_IN_ORDER support

2024-05-22 Thread Eugenio Perez Martin

On Mon, May 20, 2024 at 3:01 PM Jonah Palmer  wrote:
>
> Add VIRTIO_F_IN_ORDER feature support in virtqueue_split_pop and
> virtqueue_packed_pop.
>
> VirtQueueElements popped from the available/descritpor ring are added to
> the VirtQueue's used_elems array in-order and in the same fashion as
> they would be added the used and descriptor rings, respectively.
>
> This will allow us to keep track of the current order, what elements
> have been written, as well as an element's essential data after being
> processed.
>
> Tested-by: Lei Yang 
> Signed-off-by: Jonah Palmer 
> ---
>  hw/virtio/virtio.c | 17 -
>  1 file changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index 893a072c9d..7456d61bc8 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -1506,7 +1506,7 @@ static void *virtqueue_alloc_element(size_t sz, 
> unsigned out_num, unsigned in_nu
>
>  static void *virtqueue_split_pop(VirtQueue *vq, size_t sz)
>  {
> -unsigned int i, head, max;
> +unsigned int i, head, max, prev_avail_idx;
>  VRingMemoryRegionCaches *caches;
>  MemoryRegionCache indirect_desc_cache;
>  MemoryRegionCache *desc_cache;
> @@ -1539,6 +1539,8 @@ static void *virtqueue_split_pop(VirtQueue *vq, size_t 
> sz)
>  goto done;
>  }
>
> +prev_avail_idx = vq->last_avail_idx;
> +
>  if (!virtqueue_get_head(vq, vq->last_avail_idx++, &head)) {
>  goto done;
>  }
> @@ -1630,6 +1632,12 @@ static void *virtqueue_split_pop(VirtQueue *vq, size_t 
> sz)
>  elem->in_sg[i] = iov[out_num + i];
>  }
>
> +if (virtio_vdev_has_feature(vdev, VIRTIO_F_IN_ORDER)) {

I think vq->last_avail_idx - 1 could be more clear here.

Either way,

Reviewed-by: Eugenio Pérez 

> +vq->used_elems[prev_avail_idx].index = elem->index;
> +vq->used_elems[prev_avail_idx].len = elem->len;
> +vq->used_elems[prev_avail_idx].ndescs = elem->ndescs;
> +}
> +
>  vq->inuse++;
>
>  trace_virtqueue_pop(vq, elem, elem->in_num, elem->out_num);
> @@ -1758,6 +1766,13 @@ static void *virtqueue_packed_pop(VirtQueue *vq, 
> size_t sz)
>
>  elem->index = id;
>  elem->ndescs = (desc_cache == &indirect_desc_cache) ? 1 : elem_entries;
> +
> +if (virtio_vdev_has_feature(vdev, VIRTIO_F_IN_ORDER)) {
> +vq->used_elems[vq->last_avail_idx].index = elem->index;
> +vq->used_elems[vq->last_avail_idx].len = elem->len;
> +vq->used_elems[vq->last_avail_idx].ndescs = elem->ndescs;
> +}
> +
>  vq->last_avail_idx += elem->ndescs;
>  vq->inuse += elem->ndescs;
>
> --
> 2.39.3
>

Re: [PATCH v2 1/6] virtio: Add bool to VirtQueueElement

2024-05-22 Thread Eugenio Perez Martin

On Mon, May 20, 2024 at 3:01 PM Jonah Palmer  wrote:
>
> Add the boolean 'in_order_filled' member to the VirtQueueElement structure.
> The use of this boolean will signify whether the element has been processed
> and is ready to be flushed (so long as the element is in-order). This
> boolean is used to support the VIRTIO_F_IN_ORDER feature.
>
> Tested-by: Lei Yang 

The code has changed from the version that Lei tested, so we should
drop this tag until he re-test again.

Reviewed-by: Eugenio Pérez 

> Signed-off-by: Jonah Palmer 
> ---
>  include/hw/virtio/virtio.h | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> index 7d5ffdc145..88e70c1ae1 100644
> --- a/include/hw/virtio/virtio.h
> +++ b/include/hw/virtio/virtio.h
> @@ -69,6 +69,8 @@ typedef struct VirtQueueElement
>  unsigned int ndescs;
>  unsigned int out_num;
>  unsigned int in_num;
> +/* Element has been processed (VIRTIO_F_IN_ORDER) */
> +bool in_order_filled;
>  hwaddr *in_addr;
>  hwaddr *out_addr;
>  struct iovec *in_sg;
> --
> 2.39.3
>

Re: Intention to work on GSoC project

2024-05-13 Thread Eugenio Perez Martin

On Mon, May 13, 2024 at 3:49 PM Sahil  wrote:
>
> Hi,
>
> On Wednesday, May 8, 2024 8:53:12 AM GMT+5:30 Sahil wrote:
> > Hi,
> >
> > On Tuesday, May 7, 2024 12:44:33 PM IST Eugenio Perez Martin wrote:
> > > [...]
> > >
> > > > Shall I start by implementing a mechanism to check if the feature bit
> > > > "VIRTIO_F_RING_PACKED" is set (using "virtio_vdev_has_feature")? And
> > > > if it's supported, "vhost_svq_add" should call "vhost_svq_add_packed".
> > > > Following this, I can then start implementing "vhost_svq_add_packed"
> > > > and progress from there.
> > > >
> > > > What are your thoughts on this?
> > >
> > > Yes, that's totally right.
> > >
> > > I recommend you to also disable _F_EVENT_IDX to start, so the first
> > > version is easier.
> > >
> > > Also, you can send as many incomplete RFCs as you want. For example,
> > > you can send a first version that only implements reading of the guest
> > > avail ring, so we know we're aligned on that. Then, we can send
> > > subsequents RFCs adding features on top.
> >
>
> I have started working on implementing packed virtqueue support in
> vhost-shadow-virtqueue.c. The changes I have made so far are very
> minimal. I have one confusion as well.
>
> In "vhost_svq_add()" [1], a structure of type "VhostShadowVirtqueue"
> is being used. My initial idea was to create a whole new structure (eg:
> VhostShadowVirtqueuePacked). But I realized that "VhostShadowVirtqueue"
> is being used in a lot of other places such as in "struct vhost_vdpa" [2]
> (in "vhost-vdpa.h"). So maybe this isn't a good idea.
>
> The problem is that "VhostShadowVirtqueue" has a member of type "struct
> vring" [3] which represents a split virtqueue [4]. My idea now is to instead
> wrap this member in a union so that the struct would look something like
> this.
>
> struct VhostShadowVirtqueue {
> union {
> struct vring vring;
> struct packed_vring vring;
> }
> ...
> }
>
> I am not entirely sure if this is a good idea. It is similar to what's been 
> done
> in linux's "drivers/virtio/virtio_ring.c" ("struct vring_virtqueue" [5]).
>
> I thought I would ask this first before continuing further.
>

That's right, this second option makes perfect sense.

VhostShadowVirtqueue should abstract both split and packed. You'll see
that some members are reused, while others are only used in one
version so they are placed after a union. They should follow the same
pattern, although it is not a problem if we need to divert a little
bit from the kernel's code.

Thanks!

> Thanks,
> Sahil
>
> [1] 
> https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-shadow-virtqueue.c#L249
> [2] 
> https://gitlab.com/qemu-project/qemu/-/blob/master/include/hw/virtio/vhost-vdpa.h#L69
> [3] 
> https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-shadow-virtqueue.h#L52
> [4] 
> https://gitlab.com/qemu-project/qemu/-/blob/master/include/standard-headers/linux/virtio_ring.h#L156
> [5] 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/virtio/virtio_ring.c#n199
>
>

Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree

2024-05-13 Thread Eugenio Perez Martin

On Mon, May 13, 2024 at 10:28 AM Jason Wang  wrote:
>
> On Mon, May 13, 2024 at 2:28 PM Eugenio Perez Martin
>  wrote:
> >
> > On Sat, May 11, 2024 at 6:07 AM Jason Wang  wrote:
> > >
> > > On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin
> > >  wrote:
> > > >
> > > > On Fri, May 10, 2024 at 6:29 AM Jason Wang  wrote:
> > > > >
> > > > > On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin 
> > > > >  wrote:
> > > > > >
> > > > > > On Thu, May 9, 2024 at 8:27 AM Jason Wang  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Wed, May 8, 2024 at 4:29 AM Jason Wang  
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin 
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > On Tue, May 7, 2024 at 9:29 AM Jason Wang 
> > > > > > > > > >  wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> > > > > > > > > > >  wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang 
> > > > > > > > > > > >  wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez 
> > > > > > > > > > > > >  wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The guest may have overlapped memory regions, where 
> > > > > > > > > > > > > > different GPA leads
> > > > > > > > > > > > > > to the same HVA.  This causes a problem when 
> > > > > > > > > > > > > > overlapped regions
> > > > > > > > > > > > > > (different GPA but same translated HVA) exists in 
> > > > > > > > > > > > > > the tree, as looking
> > > > > > > > > > > > > > them by HVA will return them twice.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I think I don't understand if there's any side effect 
> > > > > > > > > > > > > for shadow virtqueue?
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > My bad, I totally forgot to put a reference to where 
> > > > > > > > > > > > this comes from.
> > > > > > > > > > > >
> > > > > > > > > > > > Si-Wei found that during initialization this sequences 
> > > > > > > > > > > > of maps /
> > > > > > > > > > > > unmaps happens [1]:
> > > > > > > > > > > >
> > > > > > > > > > > > HVAGPAIOVA
> > > > > > > > > > > > -
> > > > > > > > > > > > Map
> > > > > > > > > > > > [0x7f7903e0, 0x7f7983e0)[0x0, 0x8000) 
> > > > > > > > > > > > [0x1000, 0x8000)
> > > > > > > > > > > > [0x7f7983e0, 0x7f9903e0)[0x1, 
> > > > > > > > > > > > 0x208000)
> > > > > > > > > > > > [0x80001000, 0x201000)
> > > > > > > > > > > > [0x7f7903ea, 0x7f7903ec)[0xfeda, 
> > > > > > > > > > > > 0xfedc)
> > > > > > > > > > > > [0x201000, 0x22100

Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree

2024-05-12 Thread Eugenio Perez Martin

On Sat, May 11, 2024 at 6:07 AM Jason Wang  wrote:
>
> On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin
>  wrote:
> >
> > On Fri, May 10, 2024 at 6:29 AM Jason Wang  wrote:
> > >
> > > On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin  
> > > wrote:
> > > >
> > > > On Thu, May 9, 2024 at 8:27 AM Jason Wang  wrote:
> > > > >
> > > > > On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin 
> > > > >  wrote:
> > > > > >
> > > > > > On Wed, May 8, 2024 at 4:29 AM Jason Wang  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Tue, May 7, 2024 at 9:29 AM Jason Wang  
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang 
> > > > > > > > > >  wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez 
> > > > > > > > > > >  wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > The guest may have overlapped memory regions, where 
> > > > > > > > > > > > different GPA leads
> > > > > > > > > > > > to the same HVA.  This causes a problem when overlapped 
> > > > > > > > > > > > regions
> > > > > > > > > > > > (different GPA but same translated HVA) exists in the 
> > > > > > > > > > > > tree, as looking
> > > > > > > > > > > > them by HVA will return them twice.
> > > > > > > > > > >
> > > > > > > > > > > I think I don't understand if there's any side effect for 
> > > > > > > > > > > shadow virtqueue?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > My bad, I totally forgot to put a reference to where this 
> > > > > > > > > > comes from.
> > > > > > > > > >
> > > > > > > > > > Si-Wei found that during initialization this sequences of 
> > > > > > > > > > maps /
> > > > > > > > > > unmaps happens [1]:
> > > > > > > > > >
> > > > > > > > > > HVAGPAIOVA
> > > > > > > > > > -
> > > > > > > > > > Map
> > > > > > > > > > [0x7f7903e0, 0x7f7983e0)[0x0, 0x8000) 
> > > > > > > > > > [0x1000, 0x8000)
> > > > > > > > > > [0x7f7983e0, 0x7f9903e0)[0x1, 
> > > > > > > > > > 0x208000)
> > > > > > > > > > [0x80001000, 0x201000)
> > > > > > > > > > [0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc)
> > > > > > > > > > [0x201000, 0x221000)
> > > > > > > > > >
> > > > > > > > > > Unmap
> > > > > > > > > > [0x7f7903ea, 0x7f7903ec)[0xfeda, 
> > > > > > > > > > 0xfedc) [0x1000,
> > > > > > > > > > 0x2) ???
> > > > > > > > > >
> > > > > > > > > > The third HVA range is contained in the first one, but 
> > > > > > > > > > exposed under a
> > > > > > > > > > different GVA (aliased). This is not "flattened" by QEMU, 
> > > > > > > > > > as GPA does
> > > > > > > > > > not overlap, only HVA.
> > > > > > >

Re: [PATCH 4/6] virtio: virtqueue_ordered_flush - VIRTIO_F_IN_ORDER support

2024-05-10 Thread Eugenio Perez Martin

On Mon, May 6, 2024 at 5:06 PM Jonah Palmer  wrote:
>
> Add VIRTIO_F_IN_ORDER feature support for virtqueue_flush operations.
>
> The goal of the virtqueue_flush operation when the VIRTIO_F_IN_ORDER
> feature has been negotiated is to write elements to the used/descriptor
> ring in-order and then update used_idx.
>
> The function iterates through the VirtQueueElement used_elems array
> in-order starting at vq->used_idx. If the element is valid (filled), the
> element is written to the used/descriptor ring. This process continues
> until we find an invalid (not filled) element.
>
> If any elements were written, the used_idx is updated.
>
> Tested-by: Lei Yang 
> Signed-off-by: Jonah Palmer 
> ---
>  hw/virtio/virtio.c | 75 +-
>  1 file changed, 74 insertions(+), 1 deletion(-)
>
> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index 064046b5e2..0efed2c88e 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -1006,6 +1006,77 @@ static void virtqueue_packed_flush(VirtQueue *vq, 
> unsigned int count)
>  }
>  }
>
> +static void virtqueue_ordered_flush(VirtQueue *vq)
> +{
> +unsigned int i = vq->used_idx;
> +unsigned int ndescs = 0;
> +uint16_t old = vq->used_idx;
> +bool packed;
> +VRingUsedElem uelem;
> +
> +packed = virtio_vdev_has_feature(vq->vdev, VIRTIO_F_RING_PACKED);
> +
> +if (packed) {
> +if (unlikely(!vq->vring.desc)) {
> +return;
> +}
> +} else if (unlikely(!vq->vring.used)) {
> +return;
> +}
> +
> +/* First expected in-order element isn't ready, nothing to do */
> +if (!vq->used_elems[i].filled) {
> +return;
> +}
> +
> +/* Write first expected in-order element to used ring (split VQs) */
> +if (!packed) {
> +uelem.id = vq->used_elems[i].index;
> +uelem.len = vq->used_elems[i].len;
> +vring_used_write(vq, &uelem, i);
> +}
> +
> +ndescs += vq->used_elems[i].ndescs;
> +i += ndescs;
> +if (i >= vq->vring.num) {
> +i -= vq->vring.num;
> +}
> +
> +/* Search for more filled elements in-order */
> +while (vq->used_elems[i].filled) {
> +if (packed) {
> +virtqueue_packed_fill_desc(vq, &vq->used_elems[i], ndescs, 
> false);
> +} else {
> +uelem.id = vq->used_elems[i].index;
> +uelem.len = vq->used_elems[i].len;
> +vring_used_write(vq, &uelem, i);
> +}
> +
> +vq->used_elems[i].filled = false;
> +ndescs += vq->used_elems[i].ndescs;
> +i += ndescs;
> +if (i >= vq->vring.num) {
> +i -= vq->vring.num;
> +}
> +}
> +

I may be missing something, but you have split out the first case as a
special one, totally out of the while loop. Can't it be contained in
the loop checking !(packed && i == vq->used_idx)? That would avoid
code duplication.

A comment can be added in the line of "first entry of packed is
written the last so the guest does not see invalid descriptors".

> +if (packed) {
> +virtqueue_packed_fill_desc(vq, &vq->used_elems[vq->used_idx], 0, 
> true);
> +vq->used_idx += ndescs;
> +if (vq->used_idx >= vq->vring.num) {
> +vq->used_idx -= vq->vring.num;
> +vq->used_wrap_counter ^= 1;
> +vq->signalled_used_valid = false;
> +}
> +} else {
> +vring_used_idx_set(vq, i);
> +if (unlikely((int16_t)(i - vq->signalled_used) < (uint16_t)(i - 
> old))) {
> +vq->signalled_used_valid = false;
> +}
> +}
> +vq->inuse -= ndescs;
> +}
> +
>  void virtqueue_flush(VirtQueue *vq, unsigned int count)
>  {
>  if (virtio_device_disabled(vq->vdev)) {
> @@ -1013,7 +1084,9 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
>  return;
>  }
>
> -if (virtio_vdev_has_feature(vq->vdev, VIRTIO_F_RING_PACKED)) {
> +if (virtio_vdev_has_feature(vq->vdev, VIRTIO_F_IN_ORDER)) {
> +virtqueue_ordered_flush(vq);
> +} else if (virtio_vdev_has_feature(vq->vdev, VIRTIO_F_RING_PACKED)) {
>  virtqueue_packed_flush(vq, count);
>  } else {
>  virtqueue_split_flush(vq, count);
> --
> 2.39.3
>

Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree

2024-05-10 Thread Eugenio Perez Martin

On Fri, May 10, 2024 at 6:29 AM Jason Wang  wrote:
>
> On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin  
> wrote:
> >
> > On Thu, May 9, 2024 at 8:27 AM Jason Wang  wrote:
> > >
> > > On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin  
> > > wrote:
> > > >
> > > > On Wed, May 8, 2024 at 4:29 AM Jason Wang  wrote:
> > > > >
> > > > > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin 
> > > > >  wrote:
> > > > > >
> > > > > > On Tue, May 7, 2024 at 9:29 AM Jason Wang  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez 
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > The guest may have overlapped memory regions, where 
> > > > > > > > > > different GPA leads
> > > > > > > > > > to the same HVA.  This causes a problem when overlapped 
> > > > > > > > > > regions
> > > > > > > > > > (different GPA but same translated HVA) exists in the tree, 
> > > > > > > > > > as looking
> > > > > > > > > > them by HVA will return them twice.
> > > > > > > > >
> > > > > > > > > I think I don't understand if there's any side effect for 
> > > > > > > > > shadow virtqueue?
> > > > > > > > >
> > > > > > > >
> > > > > > > > My bad, I totally forgot to put a reference to where this comes 
> > > > > > > > from.
> > > > > > > >
> > > > > > > > Si-Wei found that during initialization this sequences of maps /
> > > > > > > > unmaps happens [1]:
> > > > > > > >
> > > > > > > > HVAGPAIOVA
> > > > > > > > -
> > > > > > > > Map
> > > > > > > > [0x7f7903e0, 0x7f7983e0)[0x0, 0x8000) [0x1000, 
> > > > > > > > 0x8000)
> > > > > > > > [0x7f7983e0, 0x7f9903e0)[0x1, 0x208000)
> > > > > > > > [0x80001000, 0x201000)
> > > > > > > > [0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc)
> > > > > > > > [0x201000, 0x221000)
> > > > > > > >
> > > > > > > > Unmap
> > > > > > > > [0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc) 
> > > > > > > > [0x1000,
> > > > > > > > 0x2) ???
> > > > > > > >
> > > > > > > > The third HVA range is contained in the first one, but exposed 
> > > > > > > > under a
> > > > > > > > different GVA (aliased). This is not "flattened" by QEMU, as 
> > > > > > > > GPA does
> > > > > > > > not overlap, only HVA.
> > > > > > > >
> > > > > > > > At the third chunk unmap, the current algorithm finds the first 
> > > > > > > > chunk,
> > > > > > > > not the second one. This series is the way to tell the 
> > > > > > > > difference at
> > > > > > > > unmap time.
> > > > > > > >
> > > > > > > > [1] 
> > > > > > > > https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > >
> > > > > > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA 
> > > > > > > mappings in
> > > > > > > the iova tree to solve this issue completely. Then there won

Re: [PATCH 3/6] virtio: virtqueue_ordered_fill - VIRTIO_F_IN_ORDER support

2024-05-09 Thread Eugenio Perez Martin

On Mon, May 6, 2024 at 5:05 PM Jonah Palmer  wrote:
>
> Add VIRTIO_F_IN_ORDER feature support for virtqueue_fill operations.
>
> The goal of the virtqueue_fill operation when the VIRTIO_F_IN_ORDER
> feature has been negotiated is to search for this now-used element,
> set its length, and mark the element as filled in the VirtQueue's
> used_elems array.
>
> By marking the element as filled, it will indicate that this element is
> ready to be flushed, so long as the element is in-order.
>
> Tested-by: Lei Yang 
> Signed-off-by: Jonah Palmer 
> ---
>  hw/virtio/virtio.c | 26 +-
>  1 file changed, 25 insertions(+), 1 deletion(-)
>
> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index e6eb1bb453..064046b5e2 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -873,6 +873,28 @@ static void virtqueue_packed_fill(VirtQueue *vq, const 
> VirtQueueElement *elem,
>  vq->used_elems[idx].ndescs = elem->ndescs;
>  }
>
> +static void virtqueue_ordered_fill(VirtQueue *vq, const VirtQueueElement 
> *elem,
> +   unsigned int len)
> +{
> +unsigned int i = vq->used_idx;
> +
> +/* Search for element in vq->used_elems */
> +while (i != vq->last_avail_idx) {
> +/* Found element, set length and mark as filled */
> +if (vq->used_elems[i].index == elem->index) {
> +vq->used_elems[i].len = len;
> +vq->used_elems[i].filled = true;
> +break;
> +}
> +
> +i += vq->used_elems[i].ndescs;
> +
> +if (i >= vq->vring.num) {
> +i -= vq->vring.num;
> +}
> +}

This has a subtle problem: ndescs and elems->id are controlled by the
guest, so it could make QEMU to loop forever looking for the right
descriptor. For each iteration, the code must control that the
variable "i" will be different for the next iteration, and that there
will be no more than vq->last_avail_idx - vq->used_idx iterations.

Apart of that, I think it makes more sense to split the logical
sections of the function this way:
/* declarations */
i = vq->used_idx

/* Search for element in vq->used_elems */
while (vq->used_elems[i].index != elem->index &&
vq->used_elems[i].index i != vq->last_avail_idx && ...) {
...
}

/* Set length and mark as filled */
vq->used_elems[i].len = len;
vq->used_elems[i].filled = true;
---

But I'm ok either way.

> +}
> +
>  static void virtqueue_packed_fill_desc(VirtQueue *vq,
> const VirtQueueElement *elem,
> unsigned int idx,
> @@ -923,7 +945,9 @@ void virtqueue_fill(VirtQueue *vq, const VirtQueueElement 
> *elem,
>  return;
>  }
>
> -if (virtio_vdev_has_feature(vq->vdev, VIRTIO_F_RING_PACKED)) {
> +if (virtio_vdev_has_feature(vq->vdev, VIRTIO_F_IN_ORDER)) {
> +virtqueue_ordered_fill(vq, elem, len);
> +} else if (virtio_vdev_has_feature(vq->vdev, VIRTIO_F_RING_PACKED)) {
>  virtqueue_packed_fill(vq, elem, len, idx);
>  } else {
>  virtqueue_split_fill(vq, elem, len, idx);
> --
> 2.39.3
>

Re: [PATCH 2/6] virtio: virtqueue_pop - VIRTIO_F_IN_ORDER support

2024-05-09 Thread Eugenio Perez Martin

On Mon, May 6, 2024 at 5:06 PM Jonah Palmer  wrote:
>
> Add VIRTIO_F_IN_ORDER feature support in virtqueue_split_pop and
> virtqueue_packed_pop.
>
> VirtQueueElements popped from the available/descritpor ring are added to
> the VirtQueue's used_elems array in-order and in the same fashion as
> they would be added the used and descriptor rings, respectively.
>
> This will allow us to keep track of the current order, what elements
> have been written, as well as an element's essential data after being
> processed.
>
> Tested-by: Lei Yang 
> Signed-off-by: Jonah Palmer 
> ---
>  hw/virtio/virtio.c | 17 -
>  1 file changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index 893a072c9d..e6eb1bb453 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -1506,7 +1506,7 @@ static void *virtqueue_alloc_element(size_t sz, 
> unsigned out_num, unsigned in_nu
>
>  static void *virtqueue_split_pop(VirtQueue *vq, size_t sz)
>  {
> -unsigned int i, head, max;
> +unsigned int i, j, head, max;
>  VRingMemoryRegionCaches *caches;
>  MemoryRegionCache indirect_desc_cache;
>  MemoryRegionCache *desc_cache;
> @@ -1539,6 +1539,8 @@ static void *virtqueue_split_pop(VirtQueue *vq, size_t 
> sz)
>  goto done;
>  }
>
> +j = vq->last_avail_idx;
> +
>  if (!virtqueue_get_head(vq, vq->last_avail_idx++, &head)) {
>  goto done;
>  }
> @@ -1630,6 +1632,12 @@ static void *virtqueue_split_pop(VirtQueue *vq, size_t 
> sz)
>  elem->in_sg[i] = iov[out_num + i];
>  }
>
> +if (virtio_vdev_has_feature(vdev, VIRTIO_F_IN_ORDER)) {
> +vq->used_elems[j].index = elem->index;
> +vq->used_elems[j].len = elem->len;
> +vq->used_elems[j].ndescs = elem->ndescs;
> +}
> +
>  vq->inuse++;
>
>  trace_virtqueue_pop(vq, elem, elem->in_num, elem->out_num);
> @@ -1758,6 +1766,13 @@ static void *virtqueue_packed_pop(VirtQueue *vq, 
> size_t sz)
>
>  elem->index = id;
>  elem->ndescs = (desc_cache == &indirect_desc_cache) ? 1 : elem_entries;
> +
> +if (virtio_vdev_has_feature(vdev, VIRTIO_F_IN_ORDER)) {
> +vq->used_elems[vq->last_avail_idx].index = elem->index;
> +vq->used_elems[vq->last_avail_idx].len = elem->len;
> +vq->used_elems[vq->last_avail_idx].ndescs = elem->ndescs;
> +}
> +

I suggest using a consistent style between packed and split: Either
always use vq->last_avail_idx or j. If you use j, please rename to
something more related to the usage, as j is usually for iterations.

In my opinion I think vq->last_avail_idx is better.


>  vq->last_avail_idx += elem->ndescs;
>  vq->inuse += elem->ndescs;
>
> --
> 2.39.3
>

Re: [PATCH 1/6] virtio: Add bool to VirtQueueElement

2024-05-09 Thread Eugenio Perez Martin

On Mon, May 6, 2024 at 5:06 PM Jonah Palmer  wrote:
>
> Add the boolean 'filled' member to the VirtQueueElement structure. The
> use of this boolean will signify if the element has been written to the
> used / descriptor ring or not. This boolean is used to support the
> VIRTIO_F_IN_ORDER feature.
>
> Tested-by: Lei Yang 
> Signed-off-by: Jonah Palmer 
> ---
>  include/hw/virtio/virtio.h | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> index 7d5ffdc145..9ed9c3763c 100644
> --- a/include/hw/virtio/virtio.h
> +++ b/include/hw/virtio/virtio.h
> @@ -69,6 +69,7 @@ typedef struct VirtQueueElement
>  unsigned int ndescs;
>  unsigned int out_num;
>  unsigned int in_num;
> +bool filled;

in_order_filled? I cannot come with a good name for this. Maybe we can
add a comment on top of the variable so we know what it is used for?

>  hwaddr *in_addr;
>  hwaddr *out_addr;
>  struct iovec *in_sg;
> --
> 2.39.3
>

Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree

2024-05-09 Thread Eugenio Perez Martin

On Thu, May 9, 2024 at 8:27 AM Jason Wang  wrote:
>
> On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin  
> wrote:
> >
> > On Wed, May 8, 2024 at 4:29 AM Jason Wang  wrote:
> > >
> > > On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin  
> > > wrote:
> > > >
> > > > On Tue, May 7, 2024 at 9:29 AM Jason Wang  wrote:
> > > > >
> > > > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> > > > >  wrote:
> > > > > >
> > > > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > The guest may have overlapped memory regions, where different 
> > > > > > > > GPA leads
> > > > > > > > to the same HVA.  This causes a problem when overlapped regions
> > > > > > > > (different GPA but same translated HVA) exists in the tree, as 
> > > > > > > > looking
> > > > > > > > them by HVA will return them twice.
> > > > > > >
> > > > > > > I think I don't understand if there's any side effect for shadow 
> > > > > > > virtqueue?
> > > > > > >
> > > > > >
> > > > > > My bad, I totally forgot to put a reference to where this comes 
> > > > > > from.
> > > > > >
> > > > > > Si-Wei found that during initialization this sequences of maps /
> > > > > > unmaps happens [1]:
> > > > > >
> > > > > > HVAGPAIOVA
> > > > > > -
> > > > > > Map
> > > > > > [0x7f7903e0, 0x7f7983e0)[0x0, 0x8000) [0x1000, 
> > > > > > 0x8000)
> > > > > > [0x7f7983e0, 0x7f9903e0)[0x1, 0x208000)
> > > > > > [0x80001000, 0x201000)
> > > > > > [0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc)
> > > > > > [0x201000, 0x221000)
> > > > > >
> > > > > > Unmap
> > > > > > [0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc) 
> > > > > > [0x1000,
> > > > > > 0x2) ???
> > > > > >
> > > > > > The third HVA range is contained in the first one, but exposed 
> > > > > > under a
> > > > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA 
> > > > > > does
> > > > > > not overlap, only HVA.
> > > > > >
> > > > > > At the third chunk unmap, the current algorithm finds the first 
> > > > > > chunk,
> > > > > > not the second one. This series is the way to tell the difference at
> > > > > > unmap time.
> > > > > >
> > > > > > [1] 
> > > > > > https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
> > > > > >
> > > > > > Thanks!
> > > > >
> > > > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in
> > > > > the iova tree to solve this issue completely. Then there won't be
> > > > > aliasing issues.
> > > > >
> > > >
> > > > I'm ok to explore that route but this has another problem. Both SVQ
> > > > vrings and CVQ buffers also need to be addressable by VhostIOVATree,
> > > > and they do not have GPA.
> > > >
> > > > At this moment vhost_svq_translate_addr is able to handle this
> > > > transparently as we translate vaddr to SVQ IOVA. How can we store
> > > > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and
> > > > then a list to go through other entries (SVQ vaddr and CVQ buffers).
> > >
> > > This seems to be tricky.
> > >
> > > As discussed, it could be another iova tree.
> > >
> >
> > Yes but there are many ways to add another IOVATree. Let me expand & recap.
> >
> > Option 1 is to simply add another iova tree to

Re: [PATCH] hw/virtio: Fix obtain the buffer id from the last descriptor

2024-05-08 Thread Eugenio Perez Martin

On Thu, May 9, 2024 at 4:20 AM Wafer  wrote:
>
>
>
> On Thu, May, 2024 at 2:21 AM Michael S. Tsirkin  wrote:
> >
> > On Wed, May 08, 2024 at 02:56:11PM +0200, Eugenio Perez Martin wrote:
> > > On Mon, Apr 22, 2024 at 3:41 AM Wafer  wrote:
> > > >
> > > > The virtio-1.3 specification
> > > > <https://docs.oasis-open.org/virtio/virtio/v1.3/virtio-v1.3.html> 
> > > > writes:
> > > > 2.8.6 Next Flag: Descriptor Chaining
> > > >   Buffer ID is included in the last descriptor in the list.
> > > >
> > > > If the feature (_F_INDIRECT_DESC) has been negotiated, install only
> > > > one descriptor in the virtqueue.
> > > > Therefor the buffer id should be obtained from the first descriptor.
> > > >
> > > > In descriptor chaining scenarios, the buffer id should be obtained
> > > > from the last descriptor.
> > > >
> > >
> > > This is actually trickier. While it is true the standard mandates it,
> > > both linux virtio_ring driver and QEMU trusts the ID will be the first
> > > descriptor of the chain. Does merging this change in QEMU without
> > > merging the corresponding one in the linux kernel break things? Or am
> > > I missing something?
> > >
>
> The linux virtio_ring driver set the buffer id into all the descriptors of 
> the chain.
>

Ok now after reading the driver code again I see how I missed that.
Sorry for the noise!

> So Bad things can't happen, with this patch, the Linux VirtIO driver can work 
> properly.
>
> I have tested it.
>
> > > If it breaks I guess this requires more thinking. I didn't check DPDK,
> > > neither as driver nor as vhost-user device.
> > >
> > > Thanks!
> >
> > I think that if the driver is out of spec we should for starters fix it 
> > ASAP.
>
> The linux driver is within spec.
>
> >
> > > > Fixes: 86044b24e8 ("virtio: basic packed virtqueue support")
> > > >
> > > > Signed-off-by: Wafer 
> > > > ---
> > > >  hw/virtio/virtio.c | 5 +
> > > >  1 file changed, 5 insertions(+)
> > > >
> > > > diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c index
> > > > 871674f9be..f65d4b4161 100644
> > > > --- a/hw/virtio/virtio.c
> > > > +++ b/hw/virtio/virtio.c
> > > > @@ -1739,6 +1739,11 @@ static void *virtqueue_packed_pop(VirtQueue
> > *vq, size_t sz)
> > > >  goto err_undo_map;
> > > >  }
> > > >
> > > > +if (desc_cache != &indirect_desc_cache) {
> > > > +/* Buffer ID is included in the last descriptor in the 
> > > > list. */
> > > > +id = desc.id;
> > > > +}
> > > > +
> > > >  rc = virtqueue_packed_read_next_desc(vq, &desc, desc_cache, 
> > > > max,
> > &i,
> > > >   desc_cache ==
> > > >   &indirect_desc_cache);
> > > > --
> > > > 2.27.0
> > > >
>

Re: [PATCH] hw/virtio: Fix obtain the buffer id from the last descriptor

2024-05-08 Thread Eugenio Perez Martin

On Thu, May 9, 2024 at 6:32 AM Wafer  wrote:
>
>
>
> On Wed, May 08, 2024 at 12:01 PM Jason Wang  wrote:
> >
> > On Mon, Apr 22, 2024 at 9:41 AM Wafer  wrote:
> > >
> > > The virtio-1.3 specification
> > >  writes:
> > > 2.8.6 Next Flag: Descriptor Chaining
> > >   Buffer ID is included in the last descriptor in the list.
> > >
> > > If the feature (_F_INDIRECT_DESC) has been negotiated, install only
> > > one descriptor in the virtqueue.
> > > Therefor the buffer id should be obtained from the first descriptor.
> > >
> > > In descriptor chaining scenarios, the buffer id should be obtained
> > > from the last descriptor.
> > >
> > > Fixes: 86044b24e8 ("virtio: basic packed virtqueue support")
> > >
> > > Signed-off-by: Wafer 
> > > ---
> > >  hw/virtio/virtio.c | 5 +
> > >  1 file changed, 5 insertions(+)
> > >
> > > diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c index
> > > 871674f9be..f65d4b4161 100644
> > > --- a/hw/virtio/virtio.c
> > > +++ b/hw/virtio/virtio.c
> > > @@ -1739,6 +1739,11 @@ static void *virtqueue_packed_pop(VirtQueue
> > *vq, size_t sz)
> > >  goto err_undo_map;
> > >  }
> > >
> > > +if (desc_cache != &indirect_desc_cache) {
> > > +/* Buffer ID is included in the last descriptor in the list. 
> > > */
> > > +id = desc.id;
> > > +}
> >
> > It looks to me we can move this out of the loop.
> >
> > Others look good.
> >
> > Thanks
> >
>
> Thank you for your suggestion, I'll move out.
>

Please add my

Reviewed-by: Eugenio Pérez 

When you do.

Thanks!


> > > +
> > >  rc = virtqueue_packed_read_next_desc(vq, &desc, desc_cache, max,
> > &i,
> > >   desc_cache ==
> > >   &indirect_desc_cache);
> > > --
> > > 2.27.0
> > >
>

Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree

2024-05-08 Thread Eugenio Perez Martin

On Wed, May 8, 2024 at 4:29 AM Jason Wang  wrote:
>
> On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin  
> wrote:
> >
> > On Tue, May 7, 2024 at 9:29 AM Jason Wang  wrote:
> > >
> > > On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> > >  wrote:
> > > >
> > > > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang  wrote:
> > > > >
> > > > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez  
> > > > > wrote:
> > > > > >
> > > > > > The guest may have overlapped memory regions, where different GPA 
> > > > > > leads
> > > > > > to the same HVA.  This causes a problem when overlapped regions
> > > > > > (different GPA but same translated HVA) exists in the tree, as 
> > > > > > looking
> > > > > > them by HVA will return them twice.
> > > > >
> > > > > I think I don't understand if there's any side effect for shadow 
> > > > > virtqueue?
> > > > >
> > > >
> > > > My bad, I totally forgot to put a reference to where this comes from.
> > > >
> > > > Si-Wei found that during initialization this sequences of maps /
> > > > unmaps happens [1]:
> > > >
> > > > HVAGPAIOVA
> > > > -
> > > > Map
> > > > [0x7f7903e0, 0x7f7983e0)[0x0, 0x8000) [0x1000, 
> > > > 0x8000)
> > > > [0x7f7983e0, 0x7f9903e0)[0x1, 0x208000)
> > > > [0x80001000, 0x201000)
> > > > [0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc)
> > > > [0x201000, 0x221000)
> > > >
> > > > Unmap
> > > > [0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc) [0x1000,
> > > > 0x2) ???
> > > >
> > > > The third HVA range is contained in the first one, but exposed under a
> > > > different GVA (aliased). This is not "flattened" by QEMU, as GPA does
> > > > not overlap, only HVA.
> > > >
> > > > At the third chunk unmap, the current algorithm finds the first chunk,
> > > > not the second one. This series is the way to tell the difference at
> > > > unmap time.
> > > >
> > > > [1] 
> > > > https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
> > > >
> > > > Thanks!
> > >
> > > Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in
> > > the iova tree to solve this issue completely. Then there won't be
> > > aliasing issues.
> > >
> >
> > I'm ok to explore that route but this has another problem. Both SVQ
> > vrings and CVQ buffers also need to be addressable by VhostIOVATree,
> > and they do not have GPA.
> >
> > At this moment vhost_svq_translate_addr is able to handle this
> > transparently as we translate vaddr to SVQ IOVA. How can we store
> > these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and
> > then a list to go through other entries (SVQ vaddr and CVQ buffers).
>
> This seems to be tricky.
>
> As discussed, it could be another iova tree.
>

Yes but there are many ways to add another IOVATree. Let me expand & recap.

Option 1 is to simply add another iova tree to VhostShadowVirtqueue.
Let's call it gpa_iova_tree, as opposed to the current iova_tree that
translates from vaddr to SVQ IOVA. To know which one to use is easy at
adding or removing, like in the memory listener, but how to know at
vhost_svq_translate_addr?

The easiest way for me is to rely on memory_region_from_host(). When
vaddr is from the guest, it returns a valid MemoryRegion. When it is
not, it returns NULL. I'm not sure if this is a valid use case, it
just worked in my tests so far.

Now we have the second problem: The GPA values of the regions of the
two IOVA tree must be unique. We need to be able to find unallocated
regions in SVQ IOVA. At this moment there is only one IOVATree, so
this is done easily by vhost_iova_tree_map_alloc. But it is very
complicated with two trees.

Option 2a is to add another IOVATree in VhostIOVATree. I think the
easiest way is to keep the GPA -> SVQ IOVA in one tree, let's call it
iova_gpa_map, and the current vaddr -> SVQ IOVA tree in
iova_taddr_map. This second tree should contain both vaddr memory that
belongs to the guest and host-only vaddr

Re: [RFC 1/2] iova_tree: add an id member to DMAMap

2024-05-08 Thread Eugenio Perez Martin

On Wed, May 8, 2024 at 2:52 AM Si-Wei Liu  wrote:
>
>
>
> On 5/1/2024 11:44 PM, Eugenio Perez Martin wrote:
> > On Thu, May 2, 2024 at 1:16 AM Si-Wei Liu  wrote:
> >>
> >>
> >> On 4/30/2024 10:19 AM, Eugenio Perez Martin wrote:
> >>> On Tue, Apr 30, 2024 at 7:55 AM Si-Wei Liu  wrote:
> >>>>
> >>>> On 4/29/2024 1:14 AM, Eugenio Perez Martin wrote:
> >>>>> On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu  
> >>>>> wrote:
> >>>>>> On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:
> >>>>>>> On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu  
> >>>>>>> wrote:
> >>>>>>>> On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:
> >>>>>>>>> On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu  
> >>>>>>>>> wrote:
> >>>>>>>>>> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
> >>>>>>>>>>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu 
> >>>>>>>>>>>  wrote:
> >>>>>>>>>>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
> >>>>>>>>>>>>> IOVA tree is also used to track the mappings of virtio-net 
> >>>>>>>>>>>>> shadow
> >>>>>>>>>>>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> This causes a problem when overlapped regions (different GPA 
> >>>>>>>>>>>>> but same
> >>>>>>>>>>>>> translated HVA) exists in the tree, as looking them by HVA will 
> >>>>>>>>>>>>> return
> >>>>>>>>>>>>> them twice.  To solve this, create an id member so we can 
> >>>>>>>>>>>>> assign unique
> >>>>>>>>>>>>> identifiers (GPA) to the maps.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Signed-off-by: Eugenio Pérez 
> >>>>>>>>>>>>> ---
> >>>>>>>>>>>>> include/qemu/iova-tree.h | 5 +++--
> >>>>>>>>>>>>> util/iova-tree.c | 3 ++-
> >>>>>>>>>>>>> 2 files changed, 5 insertions(+), 3 deletions(-)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
> >>>>>>>>>>>>> index 2a10a7052e..34ee230e7d 100644
> >>>>>>>>>>>>> --- a/include/qemu/iova-tree.h
> >>>>>>>>>>>>> +++ b/include/qemu/iova-tree.h
> >>>>>>>>>>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
> >>>>>>>>>>>>> hwaddr iova;
> >>>>>>>>>>>>> hwaddr translated_addr;
> >>>>>>>>>>>>> hwaddr size;/* Inclusive */
> >>>>>>>>>>>>> +uint64_t id;
> >>>>>>>>>>>>> IOMMUAccessFlags perm;
> >>>>>>>>>>>>> } QEMU_PACKED DMAMap;
> >>>>>>>>>>>>> typedef gboolean (*iova_tree_iterator)(DMAMap *map);
> >>>>>>>>>>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree 
> >>>>>>>>>>>>> *tree, const DMAMap *map);
> >>>>>>>>>>>>>  * @map: the mapping to search
> >>>>>>>>>>>>>  *
> >>>>>>>>>>>>>  * Search for a mapping in the iova tree that 
> >>>>>>>>>>>>> translated_addr overlaps with the
> >>>>>>>>>>>>> - * mapping range specified.  Only the first found mapping will 
> >>>>>>>>>>>>> be
> >>>>>>>>>>>>> - * returned.
> >>>>>>>>>>>>> + * mapping range specified an

Re: [PATCH] hw/virtio: Fix obtain the buffer id from the last descriptor

2024-05-08 Thread Eugenio Perez Martin

On Mon, Apr 22, 2024 at 3:41 AM Wafer  wrote:
>
> The virtio-1.3 specification
>  writes:
> 2.8.6 Next Flag: Descriptor Chaining
>   Buffer ID is included in the last descriptor in the list.
>
> If the feature (_F_INDIRECT_DESC) has been negotiated, install only
> one descriptor in the virtqueue.
> Therefor the buffer id should be obtained from the first descriptor.
>
> In descriptor chaining scenarios, the buffer id should be obtained
> from the last descriptor.
>

This is actually trickier. While it is true the standard mandates it,
both linux virtio_ring driver and QEMU trusts the ID will be the first
descriptor of the chain. Does merging this change in QEMU without
merging the corresponding one in the linux kernel break things? Or am
I missing something?

If it breaks I guess this requires more thinking. I didn't check DPDK,
neither as driver nor as vhost-user device.

Thanks!

> Fixes: 86044b24e8 ("virtio: basic packed virtqueue support")
>
> Signed-off-by: Wafer 
> ---
>  hw/virtio/virtio.c | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index 871674f9be..f65d4b4161 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -1739,6 +1739,11 @@ static void *virtqueue_packed_pop(VirtQueue *vq, 
> size_t sz)
>  goto err_undo_map;
>  }
>
> +if (desc_cache != &indirect_desc_cache) {
> +/* Buffer ID is included in the last descriptor in the list. */
> +id = desc.id;
> +}
> +
>  rc = virtqueue_packed_read_next_desc(vq, &desc, desc_cache, max, &i,
>   desc_cache ==
>   &indirect_desc_cache);
> --
> 2.27.0
>

Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree

2024-05-07 Thread Eugenio Perez Martin

On Tue, May 7, 2024 at 9:29 AM Jason Wang  wrote:
>
> On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
>  wrote:
> >
> > On Fri, Apr 12, 2024 at 8:47 AM Jason Wang  wrote:
> > >
> > > On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez  wrote:
> > > >
> > > > The guest may have overlapped memory regions, where different GPA leads
> > > > to the same HVA.  This causes a problem when overlapped regions
> > > > (different GPA but same translated HVA) exists in the tree, as looking
> > > > them by HVA will return them twice.
> > >
> > > I think I don't understand if there's any side effect for shadow 
> > > virtqueue?
> > >
> >
> > My bad, I totally forgot to put a reference to where this comes from.
> >
> > Si-Wei found that during initialization this sequences of maps /
> > unmaps happens [1]:
> >
> > HVAGPAIOVA
> > -
> > Map
> > [0x7f7903e0, 0x7f7983e0)[0x0, 0x8000) [0x1000, 0x8000)
> > [0x7f7983e0, 0x7f9903e0)[0x1, 0x208000)
> > [0x80001000, 0x201000)
> > [0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc)
> > [0x201000, 0x221000)
> >
> > Unmap
> > [0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc) [0x1000,
> > 0x2) ???
> >
> > The third HVA range is contained in the first one, but exposed under a
> > different GVA (aliased). This is not "flattened" by QEMU, as GPA does
> > not overlap, only HVA.
> >
> > At the third chunk unmap, the current algorithm finds the first chunk,
> > not the second one. This series is the way to tell the difference at
> > unmap time.
> >
> > [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
> >
> > Thanks!
>
> Ok, I was wondering if we need to store GPA(GIOVA) to HVA mappings in
> the iova tree to solve this issue completely. Then there won't be
> aliasing issues.
>

I'm ok to explore that route but this has another problem. Both SVQ
vrings and CVQ buffers also need to be addressable by VhostIOVATree,
and they do not have GPA.

At this moment vhost_svq_translate_addr is able to handle this
transparently as we translate vaddr to SVQ IOVA. How can we store
these new entries? Maybe a (hwaddr)-1 GPA to signal it has no GPA and
then a list to go through other entries (SVQ vaddr and CVQ buffers).

Thanks!

> Thanks
>
> >
> > > Thanks
> > >
> > > >
> > > > To solve this, track GPA in the DMA entry that acs as unique identifiers
> > > > to the maps.  When the map needs to be removed, iova tree is able to
> > > > find the right one.
> > > >
> > > > Users that does not go to this extra layer of indirection can use the
> > > > iova tree as usual, with id = 0.
> > > >
> > > > This was found by Si-Wei Liu , but I'm having a 
> > > > hard
> > > > time to reproduce the issue.  This has been tested only without 
> > > > overlapping
> > > > maps.  If it works with overlapping maps, it will be intergrated in the 
> > > > main
> > > > series.
> > > >
> > > > Comments are welcome.  Thanks!
> > > >
> > > > Eugenio Pérez (2):
> > > >   iova_tree: add an id member to DMAMap
> > > >   vdpa: identify aliased maps in iova_tree
> > > >
> > > >  hw/virtio/vhost-vdpa.c   | 2 ++
> > > >  include/qemu/iova-tree.h | 5 +++--
> > > >  util/iova-tree.c | 3 ++-
> > > >  3 files changed, 7 insertions(+), 3 deletions(-)
> > > >
> > > > --
> > > > 2.44.0
> > > >
> > >
> >
>

Re: Intention to work on GSoC project

2024-05-07 Thread Eugenio Perez Martin

On Mon, May 6, 2024 at 9:00 PM Sahil  wrote:
>
> Hi,
>
> It's been a while since I last gave an update. Sorry about that. I am ready
> to get my hands dirty and start with the implementation.
>

No worries!

> I have gone through the source of linux's drivers/virtio/virtio_ring.c [1], 
> and
> QEMU's hw/virtio/virtio.c [2] and hw/virtio/vhost-shadow-virtqueue.c [3].
>
> Before actually starting I would like to make sure I am on the right track. In
> vhost-shadow-virtqueue.c, there's a function "vhost_svq_add" which in turn
> calls "vhost_svq_add_split".
>
> Shall I start by implementing a mechanism to check if the feature bit
> "VIRTIO_F_RING_PACKED" is set (using "virtio_vdev_has_feature")? And
> if it's supported, "vhost_svq_add" should call "vhost_svq_add_packed".
> Following this, I can then start implementing "vhost_svq_add_packed"
> and progress from there.
>
> What are your thoughts on this?
>

Yes, that's totally right.

I recommend you to also disable _F_EVENT_IDX to start, so the first
version is easier.

Also, you can send as many incomplete RFCs as you want. For example,
you can send a first version that only implements reading of the guest
avail ring, so we know we're aligned on that. Then, we can send
subsequents RFCs adding features on top.

Does that make sense to you?

Thanks!

> Thanks,
> Sahil
>
> [1] 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/virtio/virtio.c
> [2] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/virtio.c
> [3] 
> https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-shadow-virtqueue.c
>
>

Re: [RFC 1/2] iova_tree: add an id member to DMAMap

2024-05-01 Thread Eugenio Perez Martin

On Thu, May 2, 2024 at 1:16 AM Si-Wei Liu  wrote:
>
>
>
> On 4/30/2024 10:19 AM, Eugenio Perez Martin wrote:
> > On Tue, Apr 30, 2024 at 7:55 AM Si-Wei Liu  wrote:
> >>
> >>
> >> On 4/29/2024 1:14 AM, Eugenio Perez Martin wrote:
> >>> On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu  wrote:
> >>>>
> >>>> On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:
> >>>>> On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu  
> >>>>> wrote:
> >>>>>> On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:
> >>>>>>> On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu  
> >>>>>>> wrote:
> >>>>>>>> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
> >>>>>>>>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu  
> >>>>>>>>> wrote:
> >>>>>>>>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
> >>>>>>>>>>> IOVA tree is also used to track the mappings of virtio-net shadow
> >>>>>>>>>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
> >>>>>>>>>>>
> >>>>>>>>>>> This causes a problem when overlapped regions (different GPA but 
> >>>>>>>>>>> same
> >>>>>>>>>>> translated HVA) exists in the tree, as looking them by HVA will 
> >>>>>>>>>>> return
> >>>>>>>>>>> them twice.  To solve this, create an id member so we can assign 
> >>>>>>>>>>> unique
> >>>>>>>>>>> identifiers (GPA) to the maps.
> >>>>>>>>>>>
> >>>>>>>>>>> Signed-off-by: Eugenio Pérez 
> >>>>>>>>>>> ---
> >>>>>>>>>>>include/qemu/iova-tree.h | 5 +++--
> >>>>>>>>>>>util/iova-tree.c | 3 ++-
> >>>>>>>>>>>2 files changed, 5 insertions(+), 3 deletions(-)
> >>>>>>>>>>>
> >>>>>>>>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
> >>>>>>>>>>> index 2a10a7052e..34ee230e7d 100644
> >>>>>>>>>>> --- a/include/qemu/iova-tree.h
> >>>>>>>>>>> +++ b/include/qemu/iova-tree.h
> >>>>>>>>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
> >>>>>>>>>>>hwaddr iova;
> >>>>>>>>>>>hwaddr translated_addr;
> >>>>>>>>>>>hwaddr size;/* Inclusive */
> >>>>>>>>>>> +uint64_t id;
> >>>>>>>>>>>IOMMUAccessFlags perm;
> >>>>>>>>>>>} QEMU_PACKED DMAMap;
> >>>>>>>>>>>typedef gboolean (*iova_tree_iterator)(DMAMap *map);
> >>>>>>>>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree 
> >>>>>>>>>>> *tree, const DMAMap *map);
> >>>>>>>>>>> * @map: the mapping to search
> >>>>>>>>>>> *
> >>>>>>>>>>> * Search for a mapping in the iova tree that 
> >>>>>>>>>>> translated_addr overlaps with the
> >>>>>>>>>>> - * mapping range specified.  Only the first found mapping will be
> >>>>>>>>>>> - * returned.
> >>>>>>>>>>> + * mapping range specified and map->id is equal.  Only the first 
> >>>>>>>>>>> found
> >>>>>>>>>>> + * mapping will be returned.
> >>>>>>>>>>> *
> >>>>>>>>>>> * Return: DMAMap pointer if found, or NULL if not found.  
> >>>>>>>>>>> Note that
> >>>>>>>>>>> * the returned DMAMap pointer is maintained internally.  
> >>>>>>>>>>> User should
> >>>>>>>>>>> diff --git a/util/iova-tree.c b/util/io

Re: [RFC 1/2] iova_tree: add an id member to DMAMap

2024-05-01 Thread Eugenio Perez Martin

On Thu, May 2, 2024 at 12:09 AM Si-Wei Liu  wrote:
>
>
>
> On 4/30/2024 11:11 AM, Eugenio Perez Martin wrote:
> > On Mon, Apr 29, 2024 at 1:19 PM Jonah Palmer  
> > wrote:
> >>
> >>
> >> On 4/29/24 4:14 AM, Eugenio Perez Martin wrote:
> >>> On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu  wrote:
> >>>>
> >>>>
> >>>> On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:
> >>>>> On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu  
> >>>>> wrote:
> >>>>>>
> >>>>>> On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:
> >>>>>>> On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu  
> >>>>>>> wrote:
> >>>>>>>> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
> >>>>>>>>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu  
> >>>>>>>>> wrote:
> >>>>>>>>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
> >>>>>>>>>>> IOVA tree is also used to track the mappings of virtio-net shadow
> >>>>>>>>>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
> >>>>>>>>>>>
> >>>>>>>>>>> This causes a problem when overlapped regions (different GPA but 
> >>>>>>>>>>> same
> >>>>>>>>>>> translated HVA) exists in the tree, as looking them by HVA will 
> >>>>>>>>>>> return
> >>>>>>>>>>> them twice.  To solve this, create an id member so we can assign 
> >>>>>>>>>>> unique
> >>>>>>>>>>> identifiers (GPA) to the maps.
> >>>>>>>>>>>
> >>>>>>>>>>> Signed-off-by: Eugenio Pérez 
> >>>>>>>>>>> ---
> >>>>>>>>>>>include/qemu/iova-tree.h | 5 +++--
> >>>>>>>>>>>util/iova-tree.c | 3 ++-
> >>>>>>>>>>>2 files changed, 5 insertions(+), 3 deletions(-)
> >>>>>>>>>>>
> >>>>>>>>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
> >>>>>>>>>>> index 2a10a7052e..34ee230e7d 100644
> >>>>>>>>>>> --- a/include/qemu/iova-tree.h
> >>>>>>>>>>> +++ b/include/qemu/iova-tree.h
> >>>>>>>>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
> >>>>>>>>>>>hwaddr iova;
> >>>>>>>>>>>hwaddr translated_addr;
> >>>>>>>>>>>hwaddr size;/* Inclusive */
> >>>>>>>>>>> +uint64_t id;
> >>>>>>>>>>>IOMMUAccessFlags perm;
> >>>>>>>>>>>} QEMU_PACKED DMAMap;
> >>>>>>>>>>>typedef gboolean (*iova_tree_iterator)(DMAMap *map);
> >>>>>>>>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree 
> >>>>>>>>>>> *tree, const DMAMap *map);
> >>>>>>>>>>> * @map: the mapping to search
> >>>>>>>>>>> *
> >>>>>>>>>>> * Search for a mapping in the iova tree that 
> >>>>>>>>>>> translated_addr overlaps with the
> >>>>>>>>>>> - * mapping range specified.  Only the first found mapping will be
> >>>>>>>>>>> - * returned.
> >>>>>>>>>>> + * mapping range specified and map->id is equal.  Only the first 
> >>>>>>>>>>> found
> >>>>>>>>>>> + * mapping will be returned.
> >>>>>>>>>>> *
> >>>>>>>>>>> * Return: DMAMap pointer if found, or NULL if not found.  
> >>>>>>>>>>> Note that
> >>>>>>>>>>> * the returned DMAMap pointer is maintained internally.  
> >>>>>>>>>>> User should
> >>>>>>>

Re: [RFC 1/2] iova_tree: add an id member to DMAMap

2024-04-30 Thread Eugenio Perez Martin

On Mon, Apr 29, 2024 at 1:19 PM Jonah Palmer  wrote:
>
>
>
> On 4/29/24 4:14 AM, Eugenio Perez Martin wrote:
> > On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu  wrote:
> >>
> >>
> >>
> >> On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:
> >>> On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu  wrote:
> >>>>
> >>>>
> >>>> On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:
> >>>>> On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu  
> >>>>> wrote:
> >>>>>>
> >>>>>> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
> >>>>>>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu  
> >>>>>>> wrote:
> >>>>>>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
> >>>>>>>>> IOVA tree is also used to track the mappings of virtio-net shadow
> >>>>>>>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
> >>>>>>>>>
> >>>>>>>>> This causes a problem when overlapped regions (different GPA but 
> >>>>>>>>> same
> >>>>>>>>> translated HVA) exists in the tree, as looking them by HVA will 
> >>>>>>>>> return
> >>>>>>>>> them twice.  To solve this, create an id member so we can assign 
> >>>>>>>>> unique
> >>>>>>>>> identifiers (GPA) to the maps.
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Eugenio Pérez 
> >>>>>>>>> ---
> >>>>>>>>>   include/qemu/iova-tree.h | 5 +++--
> >>>>>>>>>   util/iova-tree.c | 3 ++-
> >>>>>>>>>   2 files changed, 5 insertions(+), 3 deletions(-)
> >>>>>>>>>
> >>>>>>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
> >>>>>>>>> index 2a10a7052e..34ee230e7d 100644
> >>>>>>>>> --- a/include/qemu/iova-tree.h
> >>>>>>>>> +++ b/include/qemu/iova-tree.h
> >>>>>>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
> >>>>>>>>>   hwaddr iova;
> >>>>>>>>>   hwaddr translated_addr;
> >>>>>>>>>   hwaddr size;/* Inclusive */
> >>>>>>>>> +uint64_t id;
> >>>>>>>>>   IOMMUAccessFlags perm;
> >>>>>>>>>   } QEMU_PACKED DMAMap;
> >>>>>>>>>   typedef gboolean (*iova_tree_iterator)(DMAMap *map);
> >>>>>>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree 
> >>>>>>>>> *tree, const DMAMap *map);
> >>>>>>>>>* @map: the mapping to search
> >>>>>>>>>*
> >>>>>>>>>* Search for a mapping in the iova tree that translated_addr 
> >>>>>>>>> overlaps with the
> >>>>>>>>> - * mapping range specified.  Only the first found mapping will be
> >>>>>>>>> - * returned.
> >>>>>>>>> + * mapping range specified and map->id is equal.  Only the first 
> >>>>>>>>> found
> >>>>>>>>> + * mapping will be returned.
> >>>>>>>>>*
> >>>>>>>>>* Return: DMAMap pointer if found, or NULL if not found.  
> >>>>>>>>> Note that
> >>>>>>>>>* the returned DMAMap pointer is maintained internally.  
> >>>>>>>>> User should
> >>>>>>>>> diff --git a/util/iova-tree.c b/util/iova-tree.c
> >>>>>>>>> index 536789797e..0863e0a3b8 100644
> >>>>>>>>> --- a/util/iova-tree.c
> >>>>>>>>> +++ b/util/iova-tree.c
> >>>>>>>>> @@ -97,7 +97,8 @@ static gboolean 
> >>>>>>>>> iova_tree_find_address_iterator(gpointer key, gpointer value,
> >>>>>>>>>
> >>>>>>>>>   needle = args->needle;
> >>>>>>>>&g

Re: [RFC 1/2] iova_tree: add an id member to DMAMap

2024-04-30 Thread Eugenio Perez Martin

On Tue, Apr 30, 2024 at 7:55 AM Si-Wei Liu  wrote:
>
>
>
> On 4/29/2024 1:14 AM, Eugenio Perez Martin wrote:
> > On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu  wrote:
> >>
> >>
> >> On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:
> >>> On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu  wrote:
> >>>>
> >>>> On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:
> >>>>> On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu  
> >>>>> wrote:
> >>>>>> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
> >>>>>>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu  
> >>>>>>> wrote:
> >>>>>>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
> >>>>>>>>> IOVA tree is also used to track the mappings of virtio-net shadow
> >>>>>>>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
> >>>>>>>>>
> >>>>>>>>> This causes a problem when overlapped regions (different GPA but 
> >>>>>>>>> same
> >>>>>>>>> translated HVA) exists in the tree, as looking them by HVA will 
> >>>>>>>>> return
> >>>>>>>>> them twice.  To solve this, create an id member so we can assign 
> >>>>>>>>> unique
> >>>>>>>>> identifiers (GPA) to the maps.
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Eugenio Pérez 
> >>>>>>>>> ---
> >>>>>>>>>   include/qemu/iova-tree.h | 5 +++--
> >>>>>>>>>   util/iova-tree.c | 3 ++-
> >>>>>>>>>   2 files changed, 5 insertions(+), 3 deletions(-)
> >>>>>>>>>
> >>>>>>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
> >>>>>>>>> index 2a10a7052e..34ee230e7d 100644
> >>>>>>>>> --- a/include/qemu/iova-tree.h
> >>>>>>>>> +++ b/include/qemu/iova-tree.h
> >>>>>>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
> >>>>>>>>>   hwaddr iova;
> >>>>>>>>>   hwaddr translated_addr;
> >>>>>>>>>   hwaddr size;/* Inclusive */
> >>>>>>>>> +uint64_t id;
> >>>>>>>>>   IOMMUAccessFlags perm;
> >>>>>>>>>   } QEMU_PACKED DMAMap;
> >>>>>>>>>   typedef gboolean (*iova_tree_iterator)(DMAMap *map);
> >>>>>>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree 
> >>>>>>>>> *tree, const DMAMap *map);
> >>>>>>>>>* @map: the mapping to search
> >>>>>>>>>*
> >>>>>>>>>* Search for a mapping in the iova tree that translated_addr 
> >>>>>>>>> overlaps with the
> >>>>>>>>> - * mapping range specified.  Only the first found mapping will be
> >>>>>>>>> - * returned.
> >>>>>>>>> + * mapping range specified and map->id is equal.  Only the first 
> >>>>>>>>> found
> >>>>>>>>> + * mapping will be returned.
> >>>>>>>>>*
> >>>>>>>>>* Return: DMAMap pointer if found, or NULL if not found.  
> >>>>>>>>> Note that
> >>>>>>>>>* the returned DMAMap pointer is maintained internally.  
> >>>>>>>>> User should
> >>>>>>>>> diff --git a/util/iova-tree.c b/util/iova-tree.c
> >>>>>>>>> index 536789797e..0863e0a3b8 100644
> >>>>>>>>> --- a/util/iova-tree.c
> >>>>>>>>> +++ b/util/iova-tree.c
> >>>>>>>>> @@ -97,7 +97,8 @@ static gboolean 
> >>>>>>>>> iova_tree_find_address_iterator(gpointer key, gpointer value,
> >>>>>>>>>
> >>>>>>>>>   needle = args->needle;
> >>>>>>>>>   if (map->translated_addr + map->size < 
&g

Re: [RFC 1/2] iova_tree: add an id member to DMAMap

2024-04-29 Thread Eugenio Perez Martin

On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu  wrote:
>
>
>
> On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:
> > On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu  wrote:
> >>
> >>
> >> On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:
> >>> On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu  wrote:
> >>>>
> >>>> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
> >>>>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu  
> >>>>> wrote:
> >>>>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
> >>>>>>> IOVA tree is also used to track the mappings of virtio-net shadow
> >>>>>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
> >>>>>>>
> >>>>>>> This causes a problem when overlapped regions (different GPA but same
> >>>>>>> translated HVA) exists in the tree, as looking them by HVA will return
> >>>>>>> them twice.  To solve this, create an id member so we can assign 
> >>>>>>> unique
> >>>>>>> identifiers (GPA) to the maps.
> >>>>>>>
> >>>>>>> Signed-off-by: Eugenio Pérez 
> >>>>>>> ---
> >>>>>>>  include/qemu/iova-tree.h | 5 +++--
> >>>>>>>  util/iova-tree.c | 3 ++-
> >>>>>>>  2 files changed, 5 insertions(+), 3 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
> >>>>>>> index 2a10a7052e..34ee230e7d 100644
> >>>>>>> --- a/include/qemu/iova-tree.h
> >>>>>>> +++ b/include/qemu/iova-tree.h
> >>>>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
> >>>>>>>  hwaddr iova;
> >>>>>>>  hwaddr translated_addr;
> >>>>>>>  hwaddr size;/* Inclusive */
> >>>>>>> +uint64_t id;
> >>>>>>>  IOMMUAccessFlags perm;
> >>>>>>>  } QEMU_PACKED DMAMap;
> >>>>>>>  typedef gboolean (*iova_tree_iterator)(DMAMap *map);
> >>>>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree 
> >>>>>>> *tree, const DMAMap *map);
> >>>>>>>   * @map: the mapping to search
> >>>>>>>   *
> >>>>>>>   * Search for a mapping in the iova tree that translated_addr 
> >>>>>>> overlaps with the
> >>>>>>> - * mapping range specified.  Only the first found mapping will be
> >>>>>>> - * returned.
> >>>>>>> + * mapping range specified and map->id is equal.  Only the first 
> >>>>>>> found
> >>>>>>> + * mapping will be returned.
> >>>>>>>   *
> >>>>>>>   * Return: DMAMap pointer if found, or NULL if not found.  Note 
> >>>>>>> that
> >>>>>>>   * the returned DMAMap pointer is maintained internally.  User 
> >>>>>>> should
> >>>>>>> diff --git a/util/iova-tree.c b/util/iova-tree.c
> >>>>>>> index 536789797e..0863e0a3b8 100644
> >>>>>>> --- a/util/iova-tree.c
> >>>>>>> +++ b/util/iova-tree.c
> >>>>>>> @@ -97,7 +97,8 @@ static gboolean 
> >>>>>>> iova_tree_find_address_iterator(gpointer key, gpointer value,
> >>>>>>>
> >>>>>>>  needle = args->needle;
> >>>>>>>  if (map->translated_addr + map->size < 
> >>>>>>> needle->translated_addr ||
> >>>>>>> -needle->translated_addr + needle->size < 
> >>>>>>> map->translated_addr) {
> >>>>>>> +needle->translated_addr + needle->size < 
> >>>>>>> map->translated_addr ||
> >>>>>>> +needle->id != map->id) {
> >>>>>> It looks this iterator can also be invoked by SVQ from
> >>>>>> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
> >>>>>> space will be searched on without passing in the ID (GPA), and exa

Re: [RFC 1/2] iova_tree: add an id member to DMAMap

2024-04-24 Thread Eugenio Perez Martin

On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu  wrote:
>
>
>
> On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:
> > On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu  wrote:
> >>
> >>
> >> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
> >>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu  wrote:
> >>>>
> >>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
> >>>>> IOVA tree is also used to track the mappings of virtio-net shadow
> >>>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
> >>>>>
> >>>>> This causes a problem when overlapped regions (different GPA but same
> >>>>> translated HVA) exists in the tree, as looking them by HVA will return
> >>>>> them twice.  To solve this, create an id member so we can assign unique
> >>>>> identifiers (GPA) to the maps.
> >>>>>
> >>>>> Signed-off-by: Eugenio Pérez 
> >>>>> ---
> >>>>> include/qemu/iova-tree.h | 5 +++--
> >>>>> util/iova-tree.c | 3 ++-
> >>>>> 2 files changed, 5 insertions(+), 3 deletions(-)
> >>>>>
> >>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
> >>>>> index 2a10a7052e..34ee230e7d 100644
> >>>>> --- a/include/qemu/iova-tree.h
> >>>>> +++ b/include/qemu/iova-tree.h
> >>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
> >>>>> hwaddr iova;
> >>>>> hwaddr translated_addr;
> >>>>> hwaddr size;/* Inclusive */
> >>>>> +uint64_t id;
> >>>>> IOMMUAccessFlags perm;
> >>>>> } QEMU_PACKED DMAMap;
> >>>>> typedef gboolean (*iova_tree_iterator)(DMAMap *map);
> >>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, 
> >>>>> const DMAMap *map);
> >>>>>  * @map: the mapping to search
> >>>>>  *
> >>>>>  * Search for a mapping in the iova tree that translated_addr 
> >>>>> overlaps with the
> >>>>> - * mapping range specified.  Only the first found mapping will be
> >>>>> - * returned.
> >>>>> + * mapping range specified and map->id is equal.  Only the first found
> >>>>> + * mapping will be returned.
> >>>>>  *
> >>>>>  * Return: DMAMap pointer if found, or NULL if not found.  Note that
> >>>>>  * the returned DMAMap pointer is maintained internally.  User 
> >>>>> should
> >>>>> diff --git a/util/iova-tree.c b/util/iova-tree.c
> >>>>> index 536789797e..0863e0a3b8 100644
> >>>>> --- a/util/iova-tree.c
> >>>>> +++ b/util/iova-tree.c
> >>>>> @@ -97,7 +97,8 @@ static gboolean 
> >>>>> iova_tree_find_address_iterator(gpointer key, gpointer value,
> >>>>>
> >>>>> needle = args->needle;
> >>>>> if (map->translated_addr + map->size < needle->translated_addr 
> >>>>> ||
> >>>>> -needle->translated_addr + needle->size < map->translated_addr) 
> >>>>> {
> >>>>> +needle->translated_addr + needle->size < map->translated_addr 
> >>>>> ||
> >>>>> +needle->id != map->id) {
> >>>> It looks this iterator can also be invoked by SVQ from
> >>>> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
> >>>> space will be searched on without passing in the ID (GPA), and exact
> >>>> match for the same GPA range is not actually needed unlike the mapping
> >>>> removal case. Could we create an API variant, for the SVQ lookup case
> >>>> specifically? Or alternatively, add a special flag, say skip_id_match to
> >>>> DMAMap, and the id match check may look like below:
> >>>>
> >>>> (!needle->skip_id_match && needle->id != map->id)
> >>>>
> >>>> I think vhost_svq_translate_addr() could just call the API variant or
> >>>> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
> >>>>
> >>> I think you're totally right. But I'd really

Re: [RFC 1/2] iova_tree: add an id member to DMAMap

2024-04-22 Thread Eugenio Perez Martin

On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu  wrote:
>
>
>
> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
> > On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu  wrote:
> >>
> >>
> >> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
> >>> IOVA tree is also used to track the mappings of virtio-net shadow
> >>> virtqueue.  This mappings may not match with the GPA->HVA ones.
> >>>
> >>> This causes a problem when overlapped regions (different GPA but same
> >>> translated HVA) exists in the tree, as looking them by HVA will return
> >>> them twice.  To solve this, create an id member so we can assign unique
> >>> identifiers (GPA) to the maps.
> >>>
> >>> Signed-off-by: Eugenio Pérez 
> >>> ---
> >>>include/qemu/iova-tree.h | 5 +++--
> >>>util/iova-tree.c | 3 ++-
> >>>2 files changed, 5 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
> >>> index 2a10a7052e..34ee230e7d 100644
> >>> --- a/include/qemu/iova-tree.h
> >>> +++ b/include/qemu/iova-tree.h
> >>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
> >>>hwaddr iova;
> >>>hwaddr translated_addr;
> >>>hwaddr size;/* Inclusive */
> >>> +uint64_t id;
> >>>IOMMUAccessFlags perm;
> >>>} QEMU_PACKED DMAMap;
> >>>typedef gboolean (*iova_tree_iterator)(DMAMap *map);
> >>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, 
> >>> const DMAMap *map);
> >>> * @map: the mapping to search
> >>> *
> >>> * Search for a mapping in the iova tree that translated_addr overlaps 
> >>> with the
> >>> - * mapping range specified.  Only the first found mapping will be
> >>> - * returned.
> >>> + * mapping range specified and map->id is equal.  Only the first found
> >>> + * mapping will be returned.
> >>> *
> >>> * Return: DMAMap pointer if found, or NULL if not found.  Note that
> >>> * the returned DMAMap pointer is maintained internally.  User should
> >>> diff --git a/util/iova-tree.c b/util/iova-tree.c
> >>> index 536789797e..0863e0a3b8 100644
> >>> --- a/util/iova-tree.c
> >>> +++ b/util/iova-tree.c
> >>> @@ -97,7 +97,8 @@ static gboolean 
> >>> iova_tree_find_address_iterator(gpointer key, gpointer value,
> >>>
> >>>needle = args->needle;
> >>>if (map->translated_addr + map->size < needle->translated_addr ||
> >>> -needle->translated_addr + needle->size < map->translated_addr) {
> >>> +needle->translated_addr + needle->size < map->translated_addr ||
> >>> +needle->id != map->id) {
> >> It looks this iterator can also be invoked by SVQ from
> >> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
> >> space will be searched on without passing in the ID (GPA), and exact
> >> match for the same GPA range is not actually needed unlike the mapping
> >> removal case. Could we create an API variant, for the SVQ lookup case
> >> specifically? Or alternatively, add a special flag, say skip_id_match to
> >> DMAMap, and the id match check may look like below:
> >>
> >> (!needle->skip_id_match && needle->id != map->id)
> >>
> >> I think vhost_svq_translate_addr() could just call the API variant or
> >> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
> >>
> > I think you're totally right. But I'd really like to not complicate
> > the API of the iova_tree more.
> >
> > I think we can look for the hwaddr using memory_region_from_host and
> > then get the hwaddr. It is another lookup though...
> Yeah, that will be another means of doing translation without having to
> complicate the API around iova_tree. I wonder how the lookup through
> memory_region_from_host() may perform compared to the iova tree one, the
> former looks to be an O(N) linear search on a linked list while the
> latter would be roughly O(log N) on an AVL tree?

Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
linear too. It is not even ordered.

But apart from this detail you're right, I have the same concerns with
this solution too. If we see a hard performance regression we could go
to more complicated solutions, like maintaining a reverse IOVATree in
vhost-iova-tree too. First RFCs of SVQ did that actually.

Thanks!

> Of course,
> memory_region_from_host() won't search out of the guest memory space for
> sure. As this could be on the hot data path I have a little bit
> hesitance over the potential cost or performance regression this change
> could bring in, but maybe I'm overthinking it too much...
>
> Thanks,
> -Siwei
>
> >
> >> Thanks,
> >> -Siwei
> >>>return false;
> >>>}
> >>>
>

Re: [RFC 1/2] iova_tree: add an id member to DMAMap

2024-04-19 Thread Eugenio Perez Martin

On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu  wrote:
>
>
>
> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
> > IOVA tree is also used to track the mappings of virtio-net shadow
> > virtqueue.  This mappings may not match with the GPA->HVA ones.
> >
> > This causes a problem when overlapped regions (different GPA but same
> > translated HVA) exists in the tree, as looking them by HVA will return
> > them twice.  To solve this, create an id member so we can assign unique
> > identifiers (GPA) to the maps.
> >
> > Signed-off-by: Eugenio Pérez 
> > ---
> >   include/qemu/iova-tree.h | 5 +++--
> >   util/iova-tree.c | 3 ++-
> >   2 files changed, 5 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
> > index 2a10a7052e..34ee230e7d 100644
> > --- a/include/qemu/iova-tree.h
> > +++ b/include/qemu/iova-tree.h
> > @@ -36,6 +36,7 @@ typedef struct DMAMap {
> >   hwaddr iova;
> >   hwaddr translated_addr;
> >   hwaddr size;/* Inclusive */
> > +uint64_t id;
> >   IOMMUAccessFlags perm;
> >   } QEMU_PACKED DMAMap;
> >   typedef gboolean (*iova_tree_iterator)(DMAMap *map);
> > @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, 
> > const DMAMap *map);
> >* @map: the mapping to search
> >*
> >* Search for a mapping in the iova tree that translated_addr overlaps 
> > with the
> > - * mapping range specified.  Only the first found mapping will be
> > - * returned.
> > + * mapping range specified and map->id is equal.  Only the first found
> > + * mapping will be returned.
> >*
> >* Return: DMAMap pointer if found, or NULL if not found.  Note that
> >* the returned DMAMap pointer is maintained internally.  User should
> > diff --git a/util/iova-tree.c b/util/iova-tree.c
> > index 536789797e..0863e0a3b8 100644
> > --- a/util/iova-tree.c
> > +++ b/util/iova-tree.c
> > @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer 
> > key, gpointer value,
> >
> >   needle = args->needle;
> >   if (map->translated_addr + map->size < needle->translated_addr ||
> > -needle->translated_addr + needle->size < map->translated_addr) {
> > +needle->translated_addr + needle->size < map->translated_addr ||
> > +needle->id != map->id) {
>
> It looks this iterator can also be invoked by SVQ from
> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
> space will be searched on without passing in the ID (GPA), and exact
> match for the same GPA range is not actually needed unlike the mapping
> removal case. Could we create an API variant, for the SVQ lookup case
> specifically? Or alternatively, add a special flag, say skip_id_match to
> DMAMap, and the id match check may look like below:
>
> (!needle->skip_id_match && needle->id != map->id)
>
> I think vhost_svq_translate_addr() could just call the API variant or
> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
>

I think you're totally right. But I'd really like to not complicate
the API of the iova_tree more.

I think we can look for the hwaddr using memory_region_from_host and
then get the hwaddr. It is another lookup though...

> Thanks,
> -Siwei
> >   return false;
> >   }
> >
>

Re: Intention to work on GSoC project

2024-04-16 Thread Eugenio Perez Martin

On Mon, Apr 15, 2024 at 9:42 PM Sahil  wrote:
>
> Hi,
>
> Thank you for your reply.
>
> On Monday, April 15, 2024 2:27:36 PM IST Eugenio Perez Martin wrote:
> > [...]
> > > I have one question though. One of the options (use case 1 in [1])
> > >
> > > given to the "qemu-kvm" command is:
> > > > -device virtio-net-pci,netdev=vhost-vdpa0,bus=pcie.0,addr=0x7\
> > > > ,disable-modern=off,page-per-vq=on
> > >
> > > This gives an error:
> > > > Bus "pcie.0" not found
> > >
> > > Does pcie refer to PCI Express? Changing this to pci.0 works.
> >
> > Yes, you don't need to mess with pcie stuff so this solution is
> > totally valid. I think we need to change that part in the tutorial.
> >
>
> Understood.
>
> > > I read through the "device buses" section in QEMU's user
> > > documentation [5], but I have still not understood this.
> > >
> > > "ls /sys/bus/pci/devices/* | grep vdpa" does not give any results.
> > > Replacing pci with pci_express doesn't give any results either. How
> > > does one know which pci bus the vdpa device is connected to?
> > > I have gone through the "vDPA bus drivers" section of the "vDPA
> > > kernel framework" article [6] but I haven't managed to find an
> > > answer yet. Am I missing something here?
> >
> > You cannot see the vDPA device from the guest. From the guest POV is a
> > regular virtio over PCI bus.
> >
> > From the host, vdpa_sim is not a PCI device either, so you cannot see
> > under /sys/bus. Do you have a vdpa* entry under
> > /sys/bus/vdpa/devices/?
> >
>
> After re-reading the linked articles, I think I have got some more
> clarity. One confusion was related to the difference between vdpa
> and vhost-vdpa.
>
> So far what I have understood is that L0 acts as the host and L1
> acts as the guest in this setup. I understand that the guest can't
> see the vDPA device.
>
> I now also understand that vdpa_sim is not a PCI device. I am also
> under the impression that vdpa refers to the vdpa bus while
> vhost-vdpa is the device. Is my understanding correct?
>
> After running the commands in the blog [1], I see that there's a
> vhost-vdpa-0 device under /dev.
>
> I also have an entry "vdpa0" under /sys/bus/vdpa/devices/ which
> is a symlink to /sys/devices/vdpa0. There's a dir "vhost-vdpa-0"
> under "/sys/devices/vdpa0". Hypothetically, if vhost-vdpa-0 had
> been a PCI device, then it would have been present under
> /sys/bus/pci/devices, right?
>

Right. You'll check that scenario with the vp_vdpa one.

> Another source of confusion was the pci.0 option passed to the
> qemu-kvm command. But I have understood this as well now:
> "-device virtio-net-pci" is a pci device.
>
> > > There's one more thing. In "use case 1" of "Running traffic with
> > > vhost_vdpa in Guest" [1], running "modprobe pktgen" in the L1 VM
> > >
> > > gives an error:
> > > > module pktgen couldn't be found in /lib/modules/6.5.6-300.fc39.x86_64.
> > >
> > > The kernel version is 6.5.6-300.fc39.x86_64. I haven't tried building
> > > pktgen manually in L1. I'll try that and will check if vdpa_sim works
> > > as expected after that.
> >
> > Did you install kernel-modules-internal?
>
> I just realized I had the wrong version of kernel-modules-internal
> installed. It works after installing the right version.
>

Good! So you can move to vp_vdpa, or do you have more doubts about vdpa_sim?

Thanks!

Re: Discrepancy between mmap call on DPDK/libvduse and rust vm-memory crate

2024-04-15 Thread Eugenio Perez Martin

On Sun, Apr 14, 2024 at 11:02 AM Michael S. Tsirkin  wrote:
>
> On Fri, Apr 12, 2024 at 12:15:40PM +0200, Eugenio Perez Martin wrote:
> > Hi!
> >
> > I'm building a bridge to expose vhost-user devices through VDUSE. The
> > code is still immature but I'm able to forward packets using
> > dpdk-l2fwd through VDUSE to VM. I'm now developing exposing virtiofsd,
> > but I've hit an error I'd like to discuss.
> >
> > VDUSE devices can get all the memory regions the driver is using by
> > VDUSE_IOTLB_GET_FD ioctl. It returns a file descriptor with a memory
> > region associated that can be mapped with mmap, and an information
> > entry about the map it contains:
> > * Start and end addresses from the driver POV
> > * Offset within the mmaped region of these start and end
> > * Device permissions over that region.
> >
> > [start=0xc3000][last=0xe7fff][offset=0xc3000][perm=1]
> >
> > Now when I try to map it, it is impossible for the userspace device to
> > call mmap with any offset different than 0.
>
> How exactly did you allocate memory? hugetlbfs?
>

Yes, that was definitely the cause, thank you very much!

> > So the "straightforward"
> > mmap with size = entry.last-entry.start and offset = entry.offset does
> > not work. I don't know if this is a limitation of Linux or VDUSE.
> >
> > Checking QEMU's
> > subprojects/libvduse/libvduse.c:vduse_iova_add_region() I see it
> > handles the offset by adding it up to the size, instead of using it
> > directly as a parameter in the mmap:
> >
> > void *mmap_addr = mmap(0, size + offset, prot, MAP_SHARED, fd, 0);
>
>
> CC Xie Yongji who wrote this code, too.
>

Thanks!

>
> > I can replicate it on the bridge for sure.
> >
> > Now I send the VhostUserMemoryRegion to the vhost-user application.
> > The struct has these members:
> > struct VhostUserMemoryRegion {
> > uint64_t guest_phys_addr;
> > uint64_t memory_size;
> > uint64_t userspace_addr;
> > uint64_t mmap_offset;
> > };
> >
> > So I can send the offset to the vhost-user device. I can check that
> > dpdk-l2fwd uses the same trick of adding offset to the size of the
> > mapping region [1], at
> > lib/vhost/vhost_user.c:vhost_user_mmap_region():
> >
> > mmap_size = region->size + mmap_offset;
> > mmap_addr = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
> > MAP_SHARED | populate, region->fd, 0);
> >
> > So mmap is called with offset == 0 and everybody is happy.
> >
> > Now I'm moving to virtiofsd, and vm-memory crate in particular. And it
> > performs the mmap without the size += offset trick, at
> > MmapRegionBuilder:build() [2].
> >
> > I can try to apply the offset + size trick in my bridge but I don't
> > think it is the right solution. At first glance, the right solution is
> > to mmap with the offset as vm-memory crate do. But having libvduse and
> > DPDK apply the same trick sounds to me like it is a known limitation /
> > workaround I don't know about. What is the history of this? Can VDUSE
> > problem (if any) be solved? Am I missing something?
> >
> > Thanks!
> >
> > [1] 
> > https://github.com/DPDK/dpdk/blob/e2e546ab5bf5e024986ccb5310ab43982f3bb40c/lib/vhost/vhost_user.c#L1305
> > [2] https://github.com/rust-vmm/vm-memory/blob/main/src/mmap_unix.rs#L128
>

Re: Intention to work on GSoC project

2024-04-15 Thread Eugenio Perez Martin

On Sun, Apr 14, 2024 at 8:52 PM Sahil  wrote:
>
> Hi,
>
> On Friday, April 5, 2024 12:36:02 AM IST Sahil wrote:
> > [...]
> > I'll set up this environment as well.
>
> I would like to post an update here. I spent the last week
> trying to set up the environment as described in the blog [1].
> I initially tried to get the L1 VM running on my host machine
> (Arch Linux). However,  I was unable to use virt-sysprep or
> virt-cutomize to install packages in the qcow2 image. It wasn't
> able to resolve the hosts while downloading the packages.
>
> According to the logs, /etc/resolv.conf was a dangling symlink.
> I tried to use "virt-rescue" to configure DNS resolution. I tried
> following these sections [2], [3] in the Arch wiki but that didn't
> work either. I tried using qemu-nbd as well following this section
> [4] to access the image. While I managed gain access to the
> image, I wasn't able to install packages after performing a
> chroot.
>
> One workaround was to set this environment up in a VM. I
> decided to set up the environment with a Fedora image in
> virtualbox acting as L0. I have managed to set up an L1 VM
> in this environment and I can load it using qemu-kvm.
>

I'm not clear if the complaint of the dangling pointer comes from the
host or from the guest env, but I think it is ok to continue if you've
been able to build the env.

> I have one question though. One of the options (use case 1 in [1])
> given to the "qemu-kvm" command is:
> > -device virtio-net-pci,netdev=vhost-vdpa0,bus=pcie.0,addr=0x7\
> > ,disable-modern=off,page-per-vq=on
>
> This gives an error:
> > Bus "pcie.0" not found
>
> Does pcie refer to PCI Express? Changing this to pci.0 works.

Yes, you don't need to mess with pcie stuff so this solution is
totally valid. I think we need to change that part in the tutorial.

> I read through the "device buses" section in QEMU's user
> documentation [5], but I have still not understood this.
>
> "ls /sys/bus/pci/devices/* | grep vdpa" does not give any results.
> Replacing pci with pci_express doesn't give any results either. How
> does one know which pci bus the vdpa device is connected to?
> I have gone through the "vDPA bus drivers" section of the "vDPA
> kernel framework" article [6] but I haven't managed to find an
> answer yet. Am I missing something here?
>

You cannot see the vDPA device from the guest. From the guest POV is a
regular virtio over PCI bus.

>From the host, vdpa_sim is not a PCI device either, so you cannot see
under /sys/bus. Do you have a vdpa* entry under
/sys/bus/vdpa/devices/?

> There's one more thing. In "use case 1" of "Running traffic with
> vhost_vdpa in Guest" [1], running "modprobe pktgen" in the L1 VM
> gives an error:
> > module pktgen couldn't be found in /lib/modules/6.5.6-300.fc39.x86_64.
>
> The kernel version is 6.5.6-300.fc39.x86_64. I haven't tried building
> pktgen manually in L1. I'll try that and will check if vdpa_sim works
> as expected after that.
>

Did you install kernel-modules-internal?

Thanks!

> [1] 
> https://www.redhat.com/en/blog/hands-vdpa-what-do-you-do-when-you-aint-got-hardware-part-1
> [2] https://wiki.archlinux.org/title/QEMU#User-mode_networking
> [3] 
> https://wiki.archlinux.org/title/Systemd-networkd#Required_services_and_setup
> [4] 
> https://wiki.archlinux.org/title/QEMU#Mounting_a_partition_from_a_qcow2_image
> [5] https://qemu-project.gitlab.io/qemu/system/device-emulation.html
> [6] 
> https://www.redhat.com/en/blog/vdpa-kernel-framework-part-1-vdpa-bus-abstracting-hardware
>
> Thanks,
> Sahil
>
>

Discrepancy between mmap call on DPDK/libvduse and rust vm-memory crate

2024-04-12 Thread Eugenio Perez Martin

Hi!

I'm building a bridge to expose vhost-user devices through VDUSE. The
code is still immature but I'm able to forward packets using
dpdk-l2fwd through VDUSE to VM. I'm now developing exposing virtiofsd,
but I've hit an error I'd like to discuss.

VDUSE devices can get all the memory regions the driver is using by
VDUSE_IOTLB_GET_FD ioctl. It returns a file descriptor with a memory
region associated that can be mapped with mmap, and an information
entry about the map it contains:
* Start and end addresses from the driver POV
* Offset within the mmaped region of these start and end
* Device permissions over that region.

[start=0xc3000][last=0xe7fff][offset=0xc3000][perm=1]

Now when I try to map it, it is impossible for the userspace device to
call mmap with any offset different than 0. So the "straightforward"
mmap with size = entry.last-entry.start and offset = entry.offset does
not work. I don't know if this is a limitation of Linux or VDUSE.

Checking QEMU's
subprojects/libvduse/libvduse.c:vduse_iova_add_region() I see it
handles the offset by adding it up to the size, instead of using it
directly as a parameter in the mmap:

void *mmap_addr = mmap(0, size + offset, prot, MAP_SHARED, fd, 0);

I can replicate it on the bridge for sure.

Now I send the VhostUserMemoryRegion to the vhost-user application.
The struct has these members:
struct VhostUserMemoryRegion {
uint64_t guest_phys_addr;
uint64_t memory_size;
uint64_t userspace_addr;
uint64_t mmap_offset;
};

So I can send the offset to the vhost-user device. I can check that
dpdk-l2fwd uses the same trick of adding offset to the size of the
mapping region [1], at
lib/vhost/vhost_user.c:vhost_user_mmap_region():

mmap_size = region->size + mmap_offset;
mmap_addr = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
MAP_SHARED | populate, region->fd, 0);

So mmap is called with offset == 0 and everybody is happy.

Now I'm moving to virtiofsd, and vm-memory crate in particular. And it
performs the mmap without the size += offset trick, at
MmapRegionBuilder:build() [2].

I can try to apply the offset + size trick in my bridge but I don't
think it is the right solution. At first glance, the right solution is
to mmap with the offset as vm-memory crate do. But having libvduse and
DPDK apply the same trick sounds to me like it is a known limitation /
workaround I don't know about. What is the history of this? Can VDUSE
problem (if any) be solved? Am I missing something?

Thanks!

[1] 
https://github.com/DPDK/dpdk/blob/e2e546ab5bf5e024986ccb5310ab43982f3bb40c/lib/vhost/vhost_user.c#L1305
[2] https://github.com/rust-vmm/vm-memory/blob/main/src/mmap_unix.rs#L128

Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree

2024-04-12 Thread Eugenio Perez Martin

On Fri, Apr 12, 2024 at 8:47 AM Jason Wang  wrote:
>
> On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez  wrote:
> >
> > The guest may have overlapped memory regions, where different GPA leads
> > to the same HVA.  This causes a problem when overlapped regions
> > (different GPA but same translated HVA) exists in the tree, as looking
> > them by HVA will return them twice.
>
> I think I don't understand if there's any side effect for shadow virtqueue?
>

My bad, I totally forgot to put a reference to where this comes from.

Si-Wei found that during initialization this sequences of maps /
unmaps happens [1]:

HVAGPAIOVA
-
Map
[0x7f7903e0, 0x7f7983e0)[0x0, 0x8000) [0x1000, 0x8000)
[0x7f7983e0, 0x7f9903e0)[0x1, 0x208000)
[0x80001000, 0x201000)
[0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc)
[0x201000, 0x221000)

Unmap
[0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc) [0x1000,
0x2) ???

The third HVA range is contained in the first one, but exposed under a
different GVA (aliased). This is not "flattened" by QEMU, as GPA does
not overlap, only HVA.

At the third chunk unmap, the current algorithm finds the first chunk,
not the second one. This series is the way to tell the difference at
unmap time.

[1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html

Thanks!

> Thanks
>
> >
> > To solve this, track GPA in the DMA entry that acs as unique identifiers
> > to the maps.  When the map needs to be removed, iova tree is able to
> > find the right one.
> >
> > Users that does not go to this extra layer of indirection can use the
> > iova tree as usual, with id = 0.
> >
> > This was found by Si-Wei Liu , but I'm having a hard
> > time to reproduce the issue.  This has been tested only without overlapping
> > maps.  If it works with overlapping maps, it will be intergrated in the main
> > series.
> >
> > Comments are welcome.  Thanks!
> >
> > Eugenio Pérez (2):
> >   iova_tree: add an id member to DMAMap
> >   vdpa: identify aliased maps in iova_tree
> >
> >  hw/virtio/vhost-vdpa.c   | 2 ++
> >  include/qemu/iova-tree.h | 5 +++--
> >  util/iova-tree.c | 3 ++-
> >  3 files changed, 7 insertions(+), 3 deletions(-)
> >
> > --
> > 2.44.0
> >
>

Re: [PULL 4/7] hw/virtio: Fix packed virtqueue flush used_idx

2024-04-09 Thread Eugenio Perez Martin

On Tue, Apr 9, 2024 at 7:40 PM Michael Tokarev  wrote:
>
> 09.04.2024 10:32, Michael S. Tsirkin wrote:
> > From: Wafer 
> >
> > In the event of writing many chains of descriptors, the device must
> > write just the id of the last buffer in the descriptor chain, skip
> > forward the number of descriptors in the chain, and then repeat the
> > operations for the rest of chains.
> >
> > Current QEMU code writes all the buffer ids consecutively, and then
> > skips all the buffers altogether. This is a bug, and can be reproduced
> > with a VirtIONet device with _F_MRG_RXBUB and without
> > _F_INDIRECT_DESC:
> >
> > If a virtio-net device has the VIRTIO_NET_F_MRG_RXBUF feature
> > but not the VIRTIO_RING_F_INDIRECT_DESC feature,
> > 'VirtIONetQueue->rx_vq' will use the merge feature
> > to store data in multiple 'elems'.
> > The 'num_buffers' in the virtio header indicates how many elements are 
> > merged.
> > If the value of 'num_buffers' is greater than 1,
> > all the merged elements will be filled into the descriptor ring.
> > The 'idx' of the elements should be the value of 'vq->used_idx' plus 
> > 'ndescs'.
> >
> > Fixes: 86044b24e8 ("virtio: basic packed virtqueue support")
> > Acked-by: Eugenio Pérez 
> > Signed-off-by: Wafer 
> > Message-Id: <20240407015451.5228-2-wa...@jaguarmicro.com>
> > Reviewed-by: Michael S. Tsirkin 
> > Signed-off-by: Michael S. Tsirkin 
> > ---
> >   hw/virtio/virtio.c | 12 ++--
> >   1 file changed, 10 insertions(+), 2 deletions(-)
>
> Is this a -stable material?
>

Hi Michael,

Yes it is. It should be easy to backport but let me know if you need any help.

Thanks!

> Thanks,
>
> /mjt
>
> > diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> > index d229755eae..c5bedca848 100644
> > --- a/hw/virtio/virtio.c
> > +++ b/hw/virtio/virtio.c
> > @@ -957,12 +957,20 @@ static void virtqueue_packed_flush(VirtQueue *vq, 
> > unsigned int count)
> >   return;
> >   }
> >
> > +/*
> > + * For indirect element's 'ndescs' is 1.
> > + * For all other elemment's 'ndescs' is the
> > + * number of descriptors chained by NEXT (as set in 
> > virtqueue_packed_pop).
> > + * So When the 'elem' be filled into the descriptor ring,
> > + * The 'idx' of this 'elem' shall be
> > + * the value of 'vq->used_idx' plus the 'ndescs'.
> > + */
> > +ndescs += vq->used_elems[0].ndescs;
> >   for (i = 1; i < count; i++) {
> > -virtqueue_packed_fill_desc(vq, &vq->used_elems[i], i, false);
> > +virtqueue_packed_fill_desc(vq, &vq->used_elems[i], ndescs, false);
> >   ndescs += vq->used_elems[i].ndescs;
> >   }
> >   virtqueue_packed_fill_desc(vq, &vq->used_elems[0], 0, true);
> > -ndescs += vq->used_elems[0].ndescs;
> >
> >   vq->inuse -= ndescs;
> >   vq->used_idx += ndescs;
>

Re: [PATCH v4] hw/virtio: Fix packed virtqueue flush used_idx

2024-04-08 Thread Eugenio Perez Martin

On Sun, Apr 7, 2024 at 3:56 AM Wafer  wrote:
>

Let me suggest a more generic description for the patch:

In the event of writing many chains of descriptors, the device must
write just the id of the last buffer in the descriptor chain, skip
forward the number of descriptors in the chain, and then repeat the
operations for the rest of chains.

Current QEMU code writes all the buffers id consecutively, and then
skip all the buffers altogether. This is a bug, and can be reproduced
with a VirtIONet device with _F_MRG_RXBUB and without
_F_INDIRECT_DESC...
---

And then your description, particularly for VirtIONet, is totally
fine. Feel free to make changes to the description or suggest a better
wording.

Thanks!

> If a virtio-net device has the VIRTIO_NET_F_MRG_RXBUF feature
> but not the VIRTIO_RING_F_INDIRECT_DESC feature,
> 'VirtIONetQueue->rx_vq' will use the merge feature
> to store data in multiple 'elems'.
> The 'num_buffers' in the virtio header indicates how many elements are merged.
> If the value of 'num_buffers' is greater than 1,
> all the merged elements will be filled into the descriptor ring.
> The 'idx' of the elements should be the value of 'vq->used_idx' plus 'ndescs'.
>
> Fixes: 86044b24e8 ("virtio: basic packed virtqueue support")
> Acked-by: Eugenio Pérez 
> Signed-off-by: Wafer 
>
> ---
> Changes in v4:
>   - Add Acked-by.
>
> Changes in v3:
>   - Add the commit-ID of the introduced problem in commit message.
>
> Changes in v2:
>   - Clarify more in commit message.
> ---
>  hw/virtio/virtio.c | 12 ++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index fb6b4ccd83..cab5832cac 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -957,12 +957,20 @@ static void virtqueue_packed_flush(VirtQueue *vq, 
> unsigned int count)
>  return;
>  }
>
> +/*
> + * For indirect element's 'ndescs' is 1.
> + * For all other elemment's 'ndescs' is the
> + * number of descriptors chained by NEXT (as set in 
> virtqueue_packed_pop).
> + * So When the 'elem' be filled into the descriptor ring,
> + * The 'idx' of this 'elem' shall be
> + * the value of 'vq->used_idx' plus the 'ndescs'.
> + */
> +ndescs += vq->used_elems[0].ndescs;
>  for (i = 1; i < count; i++) {
> -virtqueue_packed_fill_desc(vq, &vq->used_elems[i], i, false);
> +virtqueue_packed_fill_desc(vq, &vq->used_elems[i], ndescs, false);
>  ndescs += vq->used_elems[i].ndescs;
>  }
>  virtqueue_packed_fill_desc(vq, &vq->used_elems[0], 0, true);
> -ndescs += vq->used_elems[0].ndescs;
>
>  vq->inuse -= ndescs;
>  vq->used_idx += ndescs;
> --
> 2.27.0
>

Re: [PATCH v3] hw/virtio: Fix packed virtqueue flush used_idx

2024-04-05 Thread Eugenio Perez Martin

On Fri, Apr 5, 2024 at 3:20 PM Wafer  wrote:
>
> If a virtio-net device has the VIRTIO_NET_F_MRG_RXBUF feature
> but not the VIRTIO_RING_F_INDIRECT_DESC feature,
> 'VirtIONetQueue->rx_vq' will use the merge feature
> to store data in multiple 'elems'.
> The 'num_buffers' in the virtio header indicates how many elements are merged.
> If the value of 'num_buffers' is greater than 1,
> all the merged elements will be filled into the descriptor ring.
> The 'idx' of the elements should be the value of 'vq->used_idx' plus 'ndescs'.
>
> Fixes: 86044b24e8 ("virtio: basic packed virtqueue support")

Acked-by: Eugenio Pérez 

> Signed-off-by: Wafer 
>
> ---
> Changes in v3:
>   - Add the commit-ID of the introduced problem in commit message;
>
> Changes in v2:
>   - Clarify more in commit message;
> ---
>  hw/virtio/virtio.c | 12 ++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index fb6b4ccd83..cab5832cac 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -957,12 +957,20 @@ static void virtqueue_packed_flush(VirtQueue *vq, 
> unsigned int count)
>  return;
>  }
>
> +/*
> + * For indirect element's 'ndescs' is 1.
> + * For all other elemment's 'ndescs' is the
> + * number of descriptors chained by NEXT (as set in 
> virtqueue_packed_pop).
> + * So When the 'elem' be filled into the descriptor ring,
> + * The 'idx' of this 'elem' shall be
> + * the value of 'vq->used_idx' plus the 'ndescs'.
> + */
> +ndescs += vq->used_elems[0].ndescs;
>  for (i = 1; i < count; i++) {
> -virtqueue_packed_fill_desc(vq, &vq->used_elems[i], i, false);
> +virtqueue_packed_fill_desc(vq, &vq->used_elems[i], ndescs, false);
>  ndescs += vq->used_elems[i].ndescs;
>  }
>  virtqueue_packed_fill_desc(vq, &vq->used_elems[0], 0, true);
> -ndescs += vq->used_elems[0].ndescs;
>
>  vq->inuse -= ndescs;
>  vq->used_idx += ndescs;
> --
> 2.27.0
>

Re: [RFC v2 1/5] virtio: Initialize sequence variables

2024-04-05 Thread Eugenio Perez Martin

On Fri, Apr 5, 2024 at 3:59 PM Jonah Palmer  wrote:
>
>
>
> On 4/4/24 12:33 PM, Eugenio Perez Martin wrote:
> > On Thu, Apr 4, 2024 at 4:42 PM Jonah Palmer  wrote:
> >>
> >>
> >>
> >> On 4/4/24 7:35 AM, Eugenio Perez Martin wrote:
> >>> On Wed, Apr 3, 2024 at 6:51 PM Jonah Palmer  
> >>> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 4/3/24 6:18 AM, Eugenio Perez Martin wrote:
> >>>>> On Thu, Mar 28, 2024 at 5:22 PM Jonah Palmer  
> >>>>> wrote:
> >>>>>>
> >>>>>> Initialize sequence variables for VirtQueue and VirtQueueElement
> >>>>>> structures. A VirtQueue's sequence variables are initialized when a
> >>>>>> VirtQueue is being created or reset. A VirtQueueElement's sequence
> >>>>>> variable is initialized when a VirtQueueElement is being initialized.
> >>>>>> These variables will be used to support the VIRTIO_F_IN_ORDER feature.
> >>>>>>
> >>>>>> A VirtQueue's used_seq_idx represents the next expected index in a
> >>>>>> sequence of VirtQueueElements to be processed (put on the used ring).
> >>>>>> The next VirtQueueElement added to the used ring must match this
> >>>>>> sequence number before additional elements can be safely added to the
> >>>>>> used ring. It's also particularly useful for helping find the number of
> >>>>>> new elements added to the used ring.
> >>>>>>
> >>>>>> A VirtQueue's current_seq_idx represents the current sequence index.
> >>>>>> This value is essentially a counter where the value is assigned to a 
> >>>>>> new
> >>>>>> VirtQueueElement and then incremented. Given its uint16_t type, this
> >>>>>> sequence number can be between 0 and 65,535.
> >>>>>>
> >>>>>> A VirtQueueElement's seq_idx represents the sequence number assigned to
> >>>>>> the VirtQueueElement when it was created. This value must match with 
> >>>>>> the
> >>>>>> VirtQueue's used_seq_idx before the element can be put on the used ring
> >>>>>> by the device.
> >>>>>>
> >>>>>> Signed-off-by: Jonah Palmer 
> >>>>>> ---
> >>>>>> hw/virtio/virtio.c | 18 ++
> >>>>>> include/hw/virtio/virtio.h |  1 +
> >>>>>> 2 files changed, 19 insertions(+)
> >>>>>>
> >>>>>> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >>>>>> index fb6b4ccd83..069d96df99 100644
> >>>>>> --- a/hw/virtio/virtio.c
> >>>>>> +++ b/hw/virtio/virtio.c
> >>>>>> @@ -132,6 +132,10 @@ struct VirtQueue
> >>>>>> uint16_t used_idx;
> >>>>>> bool used_wrap_counter;
> >>>>>>
> >>>>>> +/* In-Order sequence indices */
> >>>>>> +uint16_t used_seq_idx;
> >>>>>> +uint16_t current_seq_idx;
> >>>>>> +
> >>>>>
> >>>>> I'm having a hard time understanding the difference between these and
> >>>>> last_avail_idx and used_idx. It seems to me if we replace them
> >>>>> everything will work? What am I missing?
> >>>>>
> >>>>
> >>>> For used_seq_idx, it does work like used_idx except the difference is
> >>>> when their values get updated, specifically for the split VQ case.
> >>>>
> >>>> As you know, for the split VQ case, the used_idx is updated during
> >>>> virtqueue_split_flush. However, imagine a batch of elements coming in
> >>>> where virtqueue_split_fill is called multiple times before
> >>>> virtqueue_split_flush. We want to make sure we write these elements to
> >>>> the used ring in-order and we'll know its order based on used_seq_idx.
> >>>>
> >>>> Alternatively, I thought about replicating the logic for the packed VQ
> >>>> case (where this used_seq_idx isn't used) where we start looking at
> >>>> vq->used_elems[vq->used_idx] and iterate through until we find a used
> >>>

Re: [PATCH v2] hw/virtio: Fix packed virtqueue flush used_idx

2024-04-05 Thread Eugenio Perez Martin

On Thu, Apr 4, 2024 at 7:03 PM Wafer  wrote:
>
> If a virtio-net device has the VIRTIO_NET_F_MRG_RXBUF feature
> but not the VIRTIO_RING_F_INDIRECT_DESC feature,
> 'VirtIONetQueue->rx_vq' will use the merge feature
> to store data in multiple 'elems'.
> The 'num_buffers' in the virtio header indicates how many elements are merged.
> If the value of 'num_buffers' is greater than 1,
> all the merged elements will be filled into the descriptor ring.
> The 'idx' of the elements should be the value of 'vq->used_idx' plus 'ndescs'.
>
> Signed-off-by: Wafer 
>

Fixes: 86044b24e8 ("virtio: basic packed virtqueue support")
?

> ---
> Changes in v2:
>   - Clarify more in commit message;
> ---
>  hw/virtio/virtio.c | 12 ++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index fb6b4ccd83..cab5832cac 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -957,12 +957,20 @@ static void virtqueue_packed_flush(VirtQueue *vq, 
> unsigned int count)
>  return;
>  }
>
> +/*
> + * For indirect element's 'ndescs' is 1.
> + * For all other elemment's 'ndescs' is the
> + * number of descriptors chained by NEXT (as set in 
> virtqueue_packed_pop).
> + * So When the 'elem' be filled into the descriptor ring,
> + * The 'idx' of this 'elem' shall be
> + * the value of 'vq->used_idx' plus the 'ndescs'.
> + */
> +ndescs += vq->used_elems[0].ndescs;
>  for (i = 1; i < count; i++) {
> -virtqueue_packed_fill_desc(vq, &vq->used_elems[i], i, false);
> +virtqueue_packed_fill_desc(vq, &vq->used_elems[i], ndescs, false);

This bugged me recently when I was reviewing it for in_order feature
implementation, thanks for the patch!

Acked-by: Eugenio Pérez 

>  ndescs += vq->used_elems[i].ndescs;
>  }
>  virtqueue_packed_fill_desc(vq, &vq->used_elems[0], 0, true);
> -ndescs += vq->used_elems[0].ndescs;
>
>  vq->inuse -= ndescs;
>  vq->used_idx += ndescs;
> --
> 2.27.0
>
>

Re: [RFC v2 1/5] virtio: Initialize sequence variables

2024-04-04 Thread Eugenio Perez Martin

On Thu, Apr 4, 2024 at 4:42 PM Jonah Palmer  wrote:
>
>
>
> On 4/4/24 7:35 AM, Eugenio Perez Martin wrote:
> > On Wed, Apr 3, 2024 at 6:51 PM Jonah Palmer  wrote:
> >>
> >>
> >>
> >> On 4/3/24 6:18 AM, Eugenio Perez Martin wrote:
> >>> On Thu, Mar 28, 2024 at 5:22 PM Jonah Palmer  
> >>> wrote:
> >>>>
> >>>> Initialize sequence variables for VirtQueue and VirtQueueElement
> >>>> structures. A VirtQueue's sequence variables are initialized when a
> >>>> VirtQueue is being created or reset. A VirtQueueElement's sequence
> >>>> variable is initialized when a VirtQueueElement is being initialized.
> >>>> These variables will be used to support the VIRTIO_F_IN_ORDER feature.
> >>>>
> >>>> A VirtQueue's used_seq_idx represents the next expected index in a
> >>>> sequence of VirtQueueElements to be processed (put on the used ring).
> >>>> The next VirtQueueElement added to the used ring must match this
> >>>> sequence number before additional elements can be safely added to the
> >>>> used ring. It's also particularly useful for helping find the number of
> >>>> new elements added to the used ring.
> >>>>
> >>>> A VirtQueue's current_seq_idx represents the current sequence index.
> >>>> This value is essentially a counter where the value is assigned to a new
> >>>> VirtQueueElement and then incremented. Given its uint16_t type, this
> >>>> sequence number can be between 0 and 65,535.
> >>>>
> >>>> A VirtQueueElement's seq_idx represents the sequence number assigned to
> >>>> the VirtQueueElement when it was created. This value must match with the
> >>>> VirtQueue's used_seq_idx before the element can be put on the used ring
> >>>> by the device.
> >>>>
> >>>> Signed-off-by: Jonah Palmer 
> >>>> ---
> >>>>hw/virtio/virtio.c | 18 ++
> >>>>include/hw/virtio/virtio.h |  1 +
> >>>>2 files changed, 19 insertions(+)
> >>>>
> >>>> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >>>> index fb6b4ccd83..069d96df99 100644
> >>>> --- a/hw/virtio/virtio.c
> >>>> +++ b/hw/virtio/virtio.c
> >>>> @@ -132,6 +132,10 @@ struct VirtQueue
> >>>>uint16_t used_idx;
> >>>>bool used_wrap_counter;
> >>>>
> >>>> +/* In-Order sequence indices */
> >>>> +uint16_t used_seq_idx;
> >>>> +uint16_t current_seq_idx;
> >>>> +
> >>>
> >>> I'm having a hard time understanding the difference between these and
> >>> last_avail_idx and used_idx. It seems to me if we replace them
> >>> everything will work? What am I missing?
> >>>
> >>
> >> For used_seq_idx, it does work like used_idx except the difference is
> >> when their values get updated, specifically for the split VQ case.
> >>
> >> As you know, for the split VQ case, the used_idx is updated during
> >> virtqueue_split_flush. However, imagine a batch of elements coming in
> >> where virtqueue_split_fill is called multiple times before
> >> virtqueue_split_flush. We want to make sure we write these elements to
> >> the used ring in-order and we'll know its order based on used_seq_idx.
> >>
> >> Alternatively, I thought about replicating the logic for the packed VQ
> >> case (where this used_seq_idx isn't used) where we start looking at
> >> vq->used_elems[vq->used_idx] and iterate through until we find a used
> >> element, but I wasn't sure how to handle the case where elements get
> >> used (written to the used ring) and new elements get put in used_elems
> >> before the used_idx is updated. Since this search would require us to
> >> always start at index vq->used_idx.
> >>
> >> For example, say, of three elements getting filled (elem0 - elem2),
> >> elem1 and elem0 come back first (vq->used_idx = 0):
> >>
> >> elem1 - not in-order
> >> elem0 - in-order, vq->used_elems[vq->used_idx + 1] (elem1) also now
> >>   in-order, write elem0 and elem1 to used ring, mark elements as
> >>   used
> >>
> >> Then elem2 comes back, but vq->used_idx is still 0,

Re: [RFC v2 1/5] virtio: Initialize sequence variables

2024-04-04 Thread Eugenio Perez Martin

On Wed, Apr 3, 2024 at 6:51 PM Jonah Palmer  wrote:
>
>
>
> On 4/3/24 6:18 AM, Eugenio Perez Martin wrote:
> > On Thu, Mar 28, 2024 at 5:22 PM Jonah Palmer  
> > wrote:
> >>
> >> Initialize sequence variables for VirtQueue and VirtQueueElement
> >> structures. A VirtQueue's sequence variables are initialized when a
> >> VirtQueue is being created or reset. A VirtQueueElement's sequence
> >> variable is initialized when a VirtQueueElement is being initialized.
> >> These variables will be used to support the VIRTIO_F_IN_ORDER feature.
> >>
> >> A VirtQueue's used_seq_idx represents the next expected index in a
> >> sequence of VirtQueueElements to be processed (put on the used ring).
> >> The next VirtQueueElement added to the used ring must match this
> >> sequence number before additional elements can be safely added to the
> >> used ring. It's also particularly useful for helping find the number of
> >> new elements added to the used ring.
> >>
> >> A VirtQueue's current_seq_idx represents the current sequence index.
> >> This value is essentially a counter where the value is assigned to a new
> >> VirtQueueElement and then incremented. Given its uint16_t type, this
> >> sequence number can be between 0 and 65,535.
> >>
> >> A VirtQueueElement's seq_idx represents the sequence number assigned to
> >> the VirtQueueElement when it was created. This value must match with the
> >> VirtQueue's used_seq_idx before the element can be put on the used ring
> >> by the device.
> >>
> >> Signed-off-by: Jonah Palmer 
> >> ---
> >>   hw/virtio/virtio.c | 18 ++
> >>   include/hw/virtio/virtio.h |  1 +
> >>   2 files changed, 19 insertions(+)
> >>
> >> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >> index fb6b4ccd83..069d96df99 100644
> >> --- a/hw/virtio/virtio.c
> >> +++ b/hw/virtio/virtio.c
> >> @@ -132,6 +132,10 @@ struct VirtQueue
> >>   uint16_t used_idx;
> >>   bool used_wrap_counter;
> >>
> >> +/* In-Order sequence indices */
> >> +uint16_t used_seq_idx;
> >> +uint16_t current_seq_idx;
> >> +
> >
> > I'm having a hard time understanding the difference between these and
> > last_avail_idx and used_idx. It seems to me if we replace them
> > everything will work? What am I missing?
> >
>
> For used_seq_idx, it does work like used_idx except the difference is
> when their values get updated, specifically for the split VQ case.
>
> As you know, for the split VQ case, the used_idx is updated during
> virtqueue_split_flush. However, imagine a batch of elements coming in
> where virtqueue_split_fill is called multiple times before
> virtqueue_split_flush. We want to make sure we write these elements to
> the used ring in-order and we'll know its order based on used_seq_idx.
>
> Alternatively, I thought about replicating the logic for the packed VQ
> case (where this used_seq_idx isn't used) where we start looking at
> vq->used_elems[vq->used_idx] and iterate through until we find a used
> element, but I wasn't sure how to handle the case where elements get
> used (written to the used ring) and new elements get put in used_elems
> before the used_idx is updated. Since this search would require us to
> always start at index vq->used_idx.
>
> For example, say, of three elements getting filled (elem0 - elem2),
> elem1 and elem0 come back first (vq->used_idx = 0):
>
> elem1 - not in-order
> elem0 - in-order, vq->used_elems[vq->used_idx + 1] (elem1) also now
>  in-order, write elem0 and elem1 to used ring, mark elements as
>  used
>
> Then elem2 comes back, but vq->used_idx is still 0, so how do we know to
> ignore the used elements at vq->used_idx (elem0) and vq->used_idx + 1
> (elem1) and iterate to vq->used_idx + 2 (elem2)?
>
> Hmm... now that I'm thinking about it, maybe for the split VQ case we
> could continue looking through the vq->used_elems array until we find an
> unused element... but then again how would we (1) know if the element is
> in-order and (2) know when to stop searching?
>

Ok I think I understand the problem now. It is aggravated if we add
chained descriptors to the mix.

We know that the order of used descriptors must be the exact same as
the order they were made available, leaving out in order batching.
What if vq->used_elems at virtqueue_pop and then virtqueue_push just
marks them as used somehow? Two boolea

Re: Intention to work on GSoC project

2024-04-03 Thread Eugenio Perez Martin

On Wed, Apr 3, 2024 at 4:36 PM Sahil  wrote:
>
> Hi,
>
> Thank you for the reply.
>
> On Tuesday, April 2, 2024 5:08:24 PM IST Eugenio Perez Martin wrote:
> > [...]
> > > > > Q2.
> > > > > In the Red Hat article, just below the first listing ("Memory layout 
> > > > > of a
> > > > > packed virtqueue descriptor"), there's the following line referring 
> > > > > to the
> > > > > buffer id in "virtq_desc":
> > > > > > This time, the id field is not an index for the device to look for 
> > > > > > the
> > > > > > buffer: it is an opaque value for it, only has meaning for the 
> > > > > > driver.
> > > > >
> > > > > But the device returns the buffer id when it writes the used 
> > > > > descriptor to
> > > > > the descriptor ring. The "only has meaning for the driver" part has 
> > > > > got me
> > > > > a little confused. Which buffer id is this that the device returns? 
> > > > > Is it related
> > > > > to the buffer id in the available descriptor?
> > > >
> > > > In my understanding, buffer id is the element that avail descriptor
> > > > marks to identify when adding descriptors to table. Device will returns
> > > > the buffer id in the processed descriptor or the last descriptor in a
> > > > chain, and write it to the descriptor that used idx refers to (first
> > > > one in the chain). Then used idx increments.
> > > >
> > > > The Packed Virtqueue blog [1] is helpful, but some details in the
> > > > examples
> > > > are making me confused.
> > > >
> > > > Q1.
> > > > In the last step of the two-entries descriptor table example, it says
> > > > both buffers #0 and #1 are available for the device. I understand
> > > > descriptor[0] is available and descriptor[1] is not, but there is no ID 
> > > > #0
> > > > now. So does the device got buffer #0 by notification beforehand? If so,
> > > > does it mean buffer #0 will be lost when notifications are disabled?
> >
> > I guess you mean the table labeled "Figure: Full two-entries descriptor
> > table".
> >
> > Take into account that the descriptor table is not the state of all
> > the descriptors. That information must be maintained by the device and
> > the driver internally.
> >
> > The descriptor table is used as a circular buffer, where one part is
> > writable by the driver and the other part is writable by the device.
> > For the device to override the descriptor table entry where descriptor
> > id 0 used to be does not mean that the descriptor id 0 is used. It
> > just means that the device communicates to the driver that descriptor
> > 1 is used, and both sides need to keep the descriptor state
> > coherently.
> >
> > > I too have a similar question and understanding the relation between
> > > buffer
> > > ids in the used and available descriptors might give more insight into
> > > this. For available descriptors, the buffer id is used to associate
> > > descriptors with a particular buffer. I am still not very sure about ids
> > > in used descriptors.
> > >
> > > Regarding Q1, both buffers #0 and #1 are available. In the mentioned
> > > figure, both descriptor[0] and descriptor[1] are available. This figure
> > > follows the figure with the caption "Using first buffer out of order". So
> > > in the first figure the device reads buffer #1 and writes the used
> > > descriptor but it still has buffer #0 to read. That still belongs to the
> > > device while buffer #1 can now be handled by the driver once again. So in
> > > the next figure, the driver makes buffer #1 available again. The device
> > > can still read buffer #0 from the previous batch of available
> > > descriptors.
> > >
> > > Based on what I have understood, the driver can't touch the descriptor
> > > corresponding to buffer #0 until the device acknowledges it. I did find
> > > the
> > > figure a little confusing as well. I think once the meaning of buffer id
> > > is clear from the driver's and device's perspective, it'll be easier to
> > > understand the figure.
> >
> > I think you got it right. Please let me know if you have further questions.
>
> I would like to clarify one thing in the fi

Re: [RFC v2 1/5] virtio: Initialize sequence variables

2024-04-03 Thread Eugenio Perez Martin

On Thu, Mar 28, 2024 at 5:22 PM Jonah Palmer  wrote:
>
> Initialize sequence variables for VirtQueue and VirtQueueElement
> structures. A VirtQueue's sequence variables are initialized when a
> VirtQueue is being created or reset. A VirtQueueElement's sequence
> variable is initialized when a VirtQueueElement is being initialized.
> These variables will be used to support the VIRTIO_F_IN_ORDER feature.
>
> A VirtQueue's used_seq_idx represents the next expected index in a
> sequence of VirtQueueElements to be processed (put on the used ring).
> The next VirtQueueElement added to the used ring must match this
> sequence number before additional elements can be safely added to the
> used ring. It's also particularly useful for helping find the number of
> new elements added to the used ring.
>
> A VirtQueue's current_seq_idx represents the current sequence index.
> This value is essentially a counter where the value is assigned to a new
> VirtQueueElement and then incremented. Given its uint16_t type, this
> sequence number can be between 0 and 65,535.
>
> A VirtQueueElement's seq_idx represents the sequence number assigned to
> the VirtQueueElement when it was created. This value must match with the
> VirtQueue's used_seq_idx before the element can be put on the used ring
> by the device.
>
> Signed-off-by: Jonah Palmer 
> ---
>  hw/virtio/virtio.c | 18 ++
>  include/hw/virtio/virtio.h |  1 +
>  2 files changed, 19 insertions(+)
>
> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index fb6b4ccd83..069d96df99 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -132,6 +132,10 @@ struct VirtQueue
>  uint16_t used_idx;
>  bool used_wrap_counter;
>
> +/* In-Order sequence indices */
> +uint16_t used_seq_idx;
> +uint16_t current_seq_idx;
> +

I'm having a hard time understanding the difference between these and
last_avail_idx and used_idx. It seems to me if we replace them
everything will work? What am I missing?

>  /* Last used index value we have signalled on */
>  uint16_t signalled_used;
>
> @@ -1621,6 +1625,11 @@ static void *virtqueue_split_pop(VirtQueue *vq, size_t 
> sz)
>  elem->in_sg[i] = iov[out_num + i];
>  }
>
> +/* Assign sequence index for in-order processing */
> +if (virtio_vdev_has_feature(vdev, VIRTIO_F_IN_ORDER)) {
> +elem->seq_idx = vq->current_seq_idx++;
> +}
> +
>  vq->inuse++;
>
>  trace_virtqueue_pop(vq, elem, elem->in_num, elem->out_num);
> @@ -1760,6 +1769,11 @@ static void *virtqueue_packed_pop(VirtQueue *vq, 
> size_t sz)
>  vq->shadow_avail_idx = vq->last_avail_idx;
>  vq->shadow_avail_wrap_counter = vq->last_avail_wrap_counter;
>
> +/* Assign sequence index for in-order processing */
> +if (virtio_vdev_has_feature(vdev, VIRTIO_F_IN_ORDER)) {
> +elem->seq_idx = vq->current_seq_idx++;
> +}
> +
>  trace_virtqueue_pop(vq, elem, elem->in_num, elem->out_num);
>  done:
>  address_space_cache_destroy(&indirect_desc_cache);
> @@ -2087,6 +2101,8 @@ static void __virtio_queue_reset(VirtIODevice *vdev, 
> uint32_t i)
>  vdev->vq[i].notification = true;
>  vdev->vq[i].vring.num = vdev->vq[i].vring.num_default;
>  vdev->vq[i].inuse = 0;
> +vdev->vq[i].used_seq_idx = 0;
> +vdev->vq[i].current_seq_idx = 0;
>  virtio_virtqueue_reset_region_cache(&vdev->vq[i]);
>  }
>
> @@ -2334,6 +2350,8 @@ VirtQueue *virtio_add_queue(VirtIODevice *vdev, int 
> queue_size,
>  vdev->vq[i].vring.align = VIRTIO_PCI_VRING_ALIGN;
>  vdev->vq[i].handle_output = handle_output;
>  vdev->vq[i].used_elems = g_new0(VirtQueueElement, queue_size);
> +vdev->vq[i].used_seq_idx = 0;
> +vdev->vq[i].current_seq_idx = 0;
>
>  return &vdev->vq[i];
>  }
> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> index b3c74a1bca..910b2a3427 100644
> --- a/include/hw/virtio/virtio.h
> +++ b/include/hw/virtio/virtio.h
> @@ -75,6 +75,7 @@ typedef struct VirtQueueElement
>  hwaddr *out_addr;
>  struct iovec *in_sg;
>  struct iovec *out_sg;
> +uint16_t seq_idx;
>  } VirtQueueElement;
>
>  #define VIRTIO_QUEUE_MAX 1024
> --
> 2.39.3
>

Re: [PATCH v2 6/7] vdpa: move iova_tree allocation to net_vhost_vdpa_init

2024-04-03 Thread Eugenio Perez Martin

On Wed, Apr 3, 2024 at 8:53 AM Si-Wei Liu  wrote:
>
>
>
> On 4/2/2024 5:01 AM, Eugenio Perez Martin wrote:
> > On Tue, Apr 2, 2024 at 8:19 AM Si-Wei Liu  wrote:
> >>
> >>
> >> On 2/14/2024 11:11 AM, Eugenio Perez Martin wrote:
> >>> On Wed, Feb 14, 2024 at 7:29 PM Si-Wei Liu  wrote:
> >>>> Hi Michael,
> >>>>
> >>>> On 2/13/2024 2:22 AM, Michael S. Tsirkin wrote:
> >>>>> On Mon, Feb 05, 2024 at 05:10:36PM -0800, Si-Wei Liu wrote:
> >>>>>> Hi Eugenio,
> >>>>>>
> >>>>>> I thought this new code looks good to me and the original issue I saw 
> >>>>>> with
> >>>>>> x-svq=on should be gone. However, after rebase my tree on top of this,
> >>>>>> there's a new failure I found around setting up guest mappings at early
> >>>>>> boot, please see attached the specific QEMU config and corresponding 
> >>>>>> event
> >>>>>> traces. Haven't checked into the detail yet, thinking you would need 
> >>>>>> to be
> >>>>>> aware of ahead.
> >>>>>>
> >>>>>> Regards,
> >>>>>> -Siwei
> >>>>> Eugenio were you able to reproduce? Siwei did you have time to
> >>>>> look into this?
> >>>> Didn't get a chance to look into the detail yet in the past week, but
> >>>> thought it may have something to do with the (internals of) iova tree
> >>>> range allocation and the lookup routine. It started to fall apart at the
> >>>> first vhost_vdpa_dma_unmap call showing up in the trace events, where it
> >>>> should've gotten IOVA=0x201000,  but an incorrect IOVA address
> >>>> 0x1000 was ended up returning from the iova tree lookup routine.
> >>>>
> >>>> HVAGPAIOVA
> >>>> -
> >>>> Map
> >>>> [0x7f7903e0, 0x7f7983e0)[0x0, 0x8000) [0x1000, 
> >>>> 0x8000)
> >>>> [0x7f7983e0, 0x7f9903e0)[0x1, 0x208000)
> >>>> [0x80001000, 0x201000)
> >>>> [0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc)
> >>>> [0x201000, 0x221000)
> >>>>
> >>>> Unmap
> >>>> [0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc) [0x1000,
> >>>> 0x2) ???
> >>>>shouldn't it be [0x201000,
> >>>> 0x221000) ???
> >>>>
> >> It looks the SVQ iova tree lookup routine vhost_iova_tree_find_iova(),
> >> which is called from vhost_vdpa_listener_region_del(), can't properly
> >> deal with overlapped region. Specifically, q35's mch_realize() has the
> >> following:
> >>
> >> 579 memory_region_init_alias(&mch->open_high_smram, OBJECT(mch),
> >> "smram-open-high",
> >> 580  mch->ram_memory,
> >> MCH_HOST_BRIDGE_SMRAM_C_BASE,
> >> 581  MCH_HOST_BRIDGE_SMRAM_C_SIZE);
> >> 582 memory_region_add_subregion_overlap(mch->system_memory, 0xfeda,
> >> 583 &mch->open_high_smram, 1);
> >> 584 memory_region_set_enabled(&mch->open_high_smram, false);
> >>
> >> #0  0x564c30bf6980 in iova_tree_find_address_iterator
> >> (key=0x564c331cf8e0, value=0x564c331cf8e0, data=0x7fffb6d749b0) at
> >> ../util/iova-tree.c:96
> >> #1  0x7f5f66479654 in g_tree_foreach () at /lib64/libglib-2.0.so.0
> >> #2  0x564c30bf6b53 in iova_tree_find_iova (tree=,
> >> map=map@entry=0x7fffb6d74a00) at ../util/iova-tree.c:114
> >> #3  0x564c309da0a9 in vhost_iova_tree_find_iova (tree= >> out>, map=map@entry=0x7fffb6d74a00) at ../hw/virtio/vhost-iova-tree.c:70
> >> #4  0x564c3085e49d in vhost_vdpa_listener_region_del
> >> (listener=0x564c331024c8, section=0x7fffb6d74aa0) at
> >> ../hw/virtio/vhost-vdpa.c:444
> >> #5  0x564c309f4931 in address_space_update_topology_pass
> >> (as=as@entry=0x564c31ab1840 ,
> >> old_view=old_view@entry=0x564c33364cc0,
> >> new_view=new_view@entry=0x564c333640f0, adding=adding@entry=false) at
> >> ../system/memory.c:977

[PATCH] Makefile: preserve --jobserver-auth argument when calling ninja

2024-04-02 Thread Martin Hundebøll

Qemu wraps its call to ninja in a Makefile. Since ninja, as opposed to
make, utilizes all CPU cores by default, the qemu Makefile translates
the absense of a `-jN` argument into `-j1`. This breaks jobserver
functionality, so update the -jN mangling to take the --jobserver-auth
argument into considerationa too.

Signed-off-by: Martin Hundebøll 
---
 Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index 8f36990335..183756018f 100644
--- a/Makefile
+++ b/Makefile
@@ -142,7 +142,7 @@ MAKE.k = $(findstring k,$(firstword $(filter-out 
--%,$(MAKEFLAGS
 MAKE.q = $(findstring q,$(firstword $(filter-out --%,$(MAKEFLAGS
 MAKE.nq = $(if $(word 2, $(MAKE.n) $(MAKE.q)),nq)
 NINJAFLAGS = $(if $V,-v) $(if $(MAKE.n), -n) $(if $(MAKE.k), -k0) \
-$(filter-out -j, $(lastword -j1 $(filter -l% -j%, $(MAKEFLAGS \
+$(or $(filter -l% -j%, $(MAKEFLAGS)), $(if $(filter 
--jobserver-auth=%, $(MAKEFLAGS)),, -j1)) \
 -d keepdepfile
 ninja-cmd-goals = $(or $(MAKECMDGOALS), all)
 ninja-cmd-goals += $(foreach g, $(MAKECMDGOALS), $(.ninja-goals.$g))
-- 
2.44.0

Re: [PATCH v2 6/7] vdpa: move iova_tree allocation to net_vhost_vdpa_init

2024-04-02 Thread Eugenio Perez Martin

On Tue, Apr 2, 2024 at 8:19 AM Si-Wei Liu  wrote:
>
>
>
> On 2/14/2024 11:11 AM, Eugenio Perez Martin wrote:
> > On Wed, Feb 14, 2024 at 7:29 PM Si-Wei Liu  wrote:
> >> Hi Michael,
> >>
> >> On 2/13/2024 2:22 AM, Michael S. Tsirkin wrote:
> >>> On Mon, Feb 05, 2024 at 05:10:36PM -0800, Si-Wei Liu wrote:
> >>>> Hi Eugenio,
> >>>>
> >>>> I thought this new code looks good to me and the original issue I saw 
> >>>> with
> >>>> x-svq=on should be gone. However, after rebase my tree on top of this,
> >>>> there's a new failure I found around setting up guest mappings at early
> >>>> boot, please see attached the specific QEMU config and corresponding 
> >>>> event
> >>>> traces. Haven't checked into the detail yet, thinking you would need to 
> >>>> be
> >>>> aware of ahead.
> >>>>
> >>>> Regards,
> >>>> -Siwei
> >>> Eugenio were you able to reproduce? Siwei did you have time to
> >>> look into this?
> >> Didn't get a chance to look into the detail yet in the past week, but
> >> thought it may have something to do with the (internals of) iova tree
> >> range allocation and the lookup routine. It started to fall apart at the
> >> first vhost_vdpa_dma_unmap call showing up in the trace events, where it
> >> should've gotten IOVA=0x201000,  but an incorrect IOVA address
> >> 0x1000 was ended up returning from the iova tree lookup routine.
> >>
> >> HVAGPAIOVA
> >> -
> >> Map
> >> [0x7f7903e0, 0x7f7983e0)[0x0, 0x8000) [0x1000, 0x8000)
> >> [0x7f7983e0, 0x7f9903e0)[0x1, 0x208000)
> >> [0x80001000, 0x201000)
> >> [0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc)
> >> [0x201000, 0x221000)
> >>
> >> Unmap
> >> [0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc) [0x1000,
> >> 0x2) ???
> >>   shouldn't it be [0x201000,
> >> 0x221000) ???
> >>
> It looks the SVQ iova tree lookup routine vhost_iova_tree_find_iova(),
> which is called from vhost_vdpa_listener_region_del(), can't properly
> deal with overlapped region. Specifically, q35's mch_realize() has the
> following:
>
> 579 memory_region_init_alias(&mch->open_high_smram, OBJECT(mch),
> "smram-open-high",
> 580  mch->ram_memory,
> MCH_HOST_BRIDGE_SMRAM_C_BASE,
> 581  MCH_HOST_BRIDGE_SMRAM_C_SIZE);
> 582 memory_region_add_subregion_overlap(mch->system_memory, 0xfeda,
> 583 &mch->open_high_smram, 1);
> 584 memory_region_set_enabled(&mch->open_high_smram, false);
>
> #0  0x564c30bf6980 in iova_tree_find_address_iterator
> (key=0x564c331cf8e0, value=0x564c331cf8e0, data=0x7fffb6d749b0) at
> ../util/iova-tree.c:96
> #1  0x7f5f66479654 in g_tree_foreach () at /lib64/libglib-2.0.so.0
> #2  0x564c30bf6b53 in iova_tree_find_iova (tree=,
> map=map@entry=0x7fffb6d74a00) at ../util/iova-tree.c:114
> #3  0x564c309da0a9 in vhost_iova_tree_find_iova (tree= out>, map=map@entry=0x7fffb6d74a00) at ../hw/virtio/vhost-iova-tree.c:70
> #4  0x564c3085e49d in vhost_vdpa_listener_region_del
> (listener=0x564c331024c8, section=0x7fffb6d74aa0) at
> ../hw/virtio/vhost-vdpa.c:444
> #5  0x564c309f4931 in address_space_update_topology_pass
> (as=as@entry=0x564c31ab1840 ,
> old_view=old_view@entry=0x564c33364cc0,
> new_view=new_view@entry=0x564c333640f0, adding=adding@entry=false) at
> ../system/memory.c:977
> #6  0x564c309f4dcd in address_space_set_flatview (as=0x564c31ab1840
> ) at ../system/memory.c:1079
> #7  0x564c309f86d0 in memory_region_transaction_commit () at
> ../system/memory.c:1132
> #8  0x564c309f86d0 in memory_region_transaction_commit () at
> ../system/memory.c:1117
> #9  0x564c307cce64 in mch_realize (d=,
> errp=) at ../hw/pci-host/q35.c:584
>
> However, it looks like iova_tree_find_address_iterator() only check if
> the translated address (HVA) falls in to the range when trying to locate
> the desired IOVA, causing the first DMAMap that happens to overlap in
> the translated address (HVA) space to be returned prematurely:
>
>   89 static gboolean iova_tree_find_address_iterator(gpointer key,
> gpointer

Re: Intention to work on GSoC project

2024-04-02 Thread Eugenio Perez Martin

On Tue, Apr 2, 2024 at 6:58 AM Sahil  wrote:
>
> Hi,
>
> On Monday, April 1, 2024 11:53:11 PM IST daleyoung4...@gmail.com wrote:
> > Hi,
> >
> > On Monday, March 25, 2024 21:20:32 CST Sahil wrote:
> > > Q1.
> > > Section 2.7.4 of the virtio spec [3] states that in an available
> > > descriptor, the "Element Length" stores the length of the buffer element.
> > > In the next few lines, it also states that the "Element Length" is
> > > reserved for used descriptors and is ignored by drivers. This sounds a
> > > little contradictory given that drivers write available desciptors in the
> > > descriptor ring.
> > When VIRTQ_DESC_F_WRITE is set, the device will use "Element Length" to
> > specify the length it writes. When VIRTQ_DESC_F_WRITE is not set, which
> > means the buffer is read-only for the device, "Element Length" will not be
> > changed by the device, so drivers just ignore it.
>
>
> Thank you for the clarification. I think I misunderstood what I had read
> in the virtio spec. What I have understood now is that "Element Length"
> has different meanings for available and used descriptors.
>
> Correct me if I am wrong - for available descriptors, it represents the
> length of the buffer. For used descriptors, it represents the length of
> the buffer that is written to by the device if it's write-only, otherwise it
> has no meaning and hence can be ignored by drivers.
>

Both answers are correct.

> > > Q2.
> > > In the Red Hat article, just below the first listing ("Memory layout of a
> > > packed virtqueue descriptor"), there's the following line referring to the
> > > buffer id in
> > >
> > > "virtq_desc":
> > > > This time, the id field is not an index for the device to look for the
> > > > buffer: it is an opaque value for it, only has meaning for the driver.
> > >
> > > But the device returns the buffer id when it writes the used descriptor to
> > > the descriptor ring. The "only has meaning for the driver" part has got me
> > > a little confused. Which buffer id is this that the device returns? Is it
> > > related to the buffer id in the available descriptor?
> >
> > In my understanding, buffer id is the element that avail descriptor marks to
> > identify when adding descriptors to table. Device will returns the buffer
> > id in the processed descriptor or the last descriptor in a chain, and write
> > it to the descriptor that used idx refers to (first one in the chain). Then
> > used idx increments.
> >
> > The Packed Virtqueue blog [1] is helpful, but some details in the examples
> > are making me confused.
> >
> > Q1.
> > In the last step of the two-entries descriptor table example, it says both
> > buffers #0 and #1 are available for the device. I understand descriptor[0]
> > is available and descriptor[1] is not, but there is no ID #0 now. So does
> > the device got buffer #0 by notification beforehand? If so, does it mean
> > buffer #0 will be lost when notifications are disabled?
> >
>

I guess you mean the table labeled "Figure: Full two-entries descriptor table".

Take into account that the descriptor table is not the state of all
the descriptors. That information must be maintained by the device and
the driver internally.

The descriptor table is used as a circular buffer, where one part is
writable by the driver and the other part is writable by the device.
For the device to override the descriptor table entry where descriptor
id 0 used to be does not mean that the descriptor id 0 is used. It
just means that the device communicates to the driver that descriptor
1 is used, and both sides need to keep the descriptor state
coherently.

> I too have a similar question and understanding the relation between buffer
> ids in the used and available descriptors might give more insight into this. 
> For
> available descriptors, the buffer id is used to associate descriptors with a
> particular buffer. I am still not very sure about ids in used descriptors.
>
> Regarding Q1, both buffers #0 and #1 are available. In the mentioned figure,
> both descriptor[0] and descriptor[1] are available. This figure follows the 
> figure
> with the caption "Using first buffer out of order". So in the first figure 
> the device
> reads buffer #1 and writes the used descriptor but it still has buffer #0 to 
> read.
> That still belongs to the device while buffer #1 can now be handled by the 
> driver
> once again. So in the next figure, the driver makes buffer #1 available 
> again. The
> device can still read buffer #0 from the previous batch of available 
> descriptors.
>
> Based on what I have understood, the driver can't touch the descriptor
> corresponding to buffer #0 until the device acknowledges it. I did find the
> figure a little confusing as well. I think once the meaning of buffer id is 
> clear
> from the driver's and device's perspective, it'll be easier to understand the
> figure.
>

I think you got it right. Please let me know if you have further questions.

> I am also not very sure about what hap

Re: [RFC 0/8] virtio,vhost: Add VIRTIO_F_IN_ORDER support

2024-03-26 Thread Eugenio Perez Martin

On Tue, Mar 26, 2024 at 5:49 PM Jonah Palmer  wrote:
>
>
>
> On 3/25/24 4:33 PM, Eugenio Perez Martin wrote:
> > On Mon, Mar 25, 2024 at 5:52 PM Jonah Palmer  
> > wrote:
> >>
> >>
> >>
> >> On 3/22/24 7:18 AM, Eugenio Perez Martin wrote:
> >>> On Thu, Mar 21, 2024 at 4:57 PM Jonah Palmer  
> >>> wrote:
> >>>>
> >>>> The goal of these patches is to add support to a variety of virtio and
> >>>> vhost devices for the VIRTIO_F_IN_ORDER transport feature. This feature
> >>>> indicates that all buffers are used by the device in the same order in
> >>>> which they were made available by the driver.
> >>>>
> >>>> These patches attempt to implement a generalized, non-device-specific
> >>>> solution to support this feature.
> >>>>
> >>>> The core feature behind this solution is a buffer mechanism in the form
> >>>> of GLib's GHashTable. The decision behind using a hash table was to
> >>>> leverage their ability for quick lookup, insertion, and removal
> >>>> operations. Given that our keys are simply numbers of an ordered
> >>>> sequence, a hash table seemed like the best choice for a buffer
> >>>> mechanism.
> >>>>
> >>>> -
> >>>>
> >>>> The strategy behind this implementation is as follows:
> >>>>
> >>>> We know that buffers that are popped from the available ring and enqueued
> >>>> for further processing will always done in the same order in which they
> >>>> were made available by the driver. Given this, we can note their order
> >>>> by assigning the resulting VirtQueueElement a key. This key is a number
> >>>> in a sequence that represents the order in which they were popped from
> >>>> the available ring, relative to the other VirtQueueElements.
> >>>>
> >>>> For example, given 3 "elements" that were popped from the available
> >>>> ring, we assign a key value to them which represents their order (elem0
> >>>> is popped first, then elem1, then lastly elem2):
> >>>>
> >>>>elem2   --  elem1   --  elem0   ---> Enqueue for processing
> >>>>   (key: 2)(key: 1)(key: 0)
> >>>>
> >>>> Then these elements are enqueued for further processing by the host.
> >>>>
> >>>> While most devices will return these completed elements in the same
> >>>> order in which they were enqueued, some devices may not (e.g.
> >>>> virtio-blk). To guarantee that these elements are put on the used ring
> >>>> in the same order in which they were enqueued, we can use a buffering
> >>>> mechanism that keeps track of the next expected sequence number of an
> >>>> element.
> >>>>
> >>>> In other words, if the completed element does not have a key value that
> >>>> matches the next expected sequence number, then we know this element is
> >>>> not in-order and we must stash it away in a hash table until an order
> >>>> can be made. The element's key value is used as the key for placing it
> >>>> in the hash table.
> >>>>
> >>>> If the completed element has a key value that matches the next expected
> >>>> sequence number, then we know this element is in-order and we can push
> >>>> it on the used ring. Then we increment the next expected sequence number
> >>>> and check if the hash table contains an element at this key location.
> >>>>
> >>>> If so, we retrieve this element, push it to the used ring, delete the
> >>>> key-value pair from the hash table, increment the next expected sequence
> >>>> number, and check the hash table again for an element at this new key
> >>>> location. This process is repeated until we're unable to find an element
> >>>> in the hash table to continue the order.
> >>>>
> >>>> So, for example, say the 3 elements we enqueued were completed in the
> >>>> following order: elem1, elem2, elem0. The next expected sequence number
> >>>> is 0:
> >>>>
> >>>>   exp-seq-num = 0:
> >>>>
>

Re: [RFC 0/8] virtio,vhost: Add VIRTIO_F_IN_ORDER support

2024-03-25 Thread Eugenio Perez Martin

On Mon, Mar 25, 2024 at 5:52 PM Jonah Palmer  wrote:
>
>
>
> On 3/22/24 7:18 AM, Eugenio Perez Martin wrote:
> > On Thu, Mar 21, 2024 at 4:57 PM Jonah Palmer  
> > wrote:
> >>
> >> The goal of these patches is to add support to a variety of virtio and
> >> vhost devices for the VIRTIO_F_IN_ORDER transport feature. This feature
> >> indicates that all buffers are used by the device in the same order in
> >> which they were made available by the driver.
> >>
> >> These patches attempt to implement a generalized, non-device-specific
> >> solution to support this feature.
> >>
> >> The core feature behind this solution is a buffer mechanism in the form
> >> of GLib's GHashTable. The decision behind using a hash table was to
> >> leverage their ability for quick lookup, insertion, and removal
> >> operations. Given that our keys are simply numbers of an ordered
> >> sequence, a hash table seemed like the best choice for a buffer
> >> mechanism.
> >>
> >> -
> >>
> >> The strategy behind this implementation is as follows:
> >>
> >> We know that buffers that are popped from the available ring and enqueued
> >> for further processing will always done in the same order in which they
> >> were made available by the driver. Given this, we can note their order
> >> by assigning the resulting VirtQueueElement a key. This key is a number
> >> in a sequence that represents the order in which they were popped from
> >> the available ring, relative to the other VirtQueueElements.
> >>
> >> For example, given 3 "elements" that were popped from the available
> >> ring, we assign a key value to them which represents their order (elem0
> >> is popped first, then elem1, then lastly elem2):
> >>
> >>   elem2   --  elem1   --  elem0   ---> Enqueue for processing
> >>  (key: 2)(key: 1)(key: 0)
> >>
> >> Then these elements are enqueued for further processing by the host.
> >>
> >> While most devices will return these completed elements in the same
> >> order in which they were enqueued, some devices may not (e.g.
> >> virtio-blk). To guarantee that these elements are put on the used ring
> >> in the same order in which they were enqueued, we can use a buffering
> >> mechanism that keeps track of the next expected sequence number of an
> >> element.
> >>
> >> In other words, if the completed element does not have a key value that
> >> matches the next expected sequence number, then we know this element is
> >> not in-order and we must stash it away in a hash table until an order
> >> can be made. The element's key value is used as the key for placing it
> >> in the hash table.
> >>
> >> If the completed element has a key value that matches the next expected
> >> sequence number, then we know this element is in-order and we can push
> >> it on the used ring. Then we increment the next expected sequence number
> >> and check if the hash table contains an element at this key location.
> >>
> >> If so, we retrieve this element, push it to the used ring, delete the
> >> key-value pair from the hash table, increment the next expected sequence
> >> number, and check the hash table again for an element at this new key
> >> location. This process is repeated until we're unable to find an element
> >> in the hash table to continue the order.
> >>
> >> So, for example, say the 3 elements we enqueued were completed in the
> >> following order: elem1, elem2, elem0. The next expected sequence number
> >> is 0:
> >>
> >>  exp-seq-num = 0:
> >>
> >>   elem1   --> elem1.key == exp-seq-num ? --> No, stash it
> >>  (key: 1) |
> >>   |
> >>   v
> >> 
> >> |key: 1 - elem1|
> >> 
> >>  -
> >>  exp-seq-num = 0:
> >>
> >>   elem2   --> elem2.key == exp-seq-num ? --> No, stash it
> >>  (key: 2) |
> >>

Re: [RFC 4/8] virtio: Implement in-order handling for virtio devices

2024-03-25 Thread Eugenio Perez Martin

On Mon, Mar 25, 2024 at 6:35 PM Jonah Palmer  wrote:
>
>
>
> On 3/22/24 6:46 AM, Eugenio Perez Martin wrote:
> > On Thu, Mar 21, 2024 at 4:57 PM Jonah Palmer  
> > wrote:
> >>
> >> Implements in-order handling for most virtio devices using the
> >> VIRTIO_F_IN_ORDER transport feature, specifically those who call
> >> virtqueue_push to push their used elements onto the used ring.
> >>
> >> The logic behind this implementation is as follows:
> >>
> >> 1.) virtqueue_pop always enqueues VirtQueueElements in-order.
> >>
> >> virtqueue_pop always retrieves one or more buffer descriptors in-order
> >> from the available ring and converts them into a VirtQueueElement. This
> >> means that the order in which VirtQueueElements are enqueued are
> >> in-order by default.
> >>
> >> By virtue, as VirtQueueElements are created, we can assign a sequential
> >> key value to them. This preserves the order of buffers that have been
> >> made available to the device by the driver.
> >>
> >> As VirtQueueElements are assigned a key value, the current sequence
> >> number is incremented.
> >>
> >> 2.) Requests can be completed out-of-order.
> >>
> >> While most devices complete requests in the same order that they were
> >> enqueued by default, some devices don't (e.g. virtio-blk). The goal of
> >> this out-of-order handling is to reduce the impact of devices that
> >> process elements in-order by default while also guaranteeing compliance
> >> with the VIRTIO_F_IN_ORDER feature.
> >>
> >> Below is the logic behind handling completed requests (which may or may
> >> not be in-order).
> >>
> >> 3.) Does the incoming used VirtQueueElement preserve the correct order?
> >>
> >> In other words, is the sequence number (key) assigned to the
> >> VirtQueueElement the expected number that would preserve the original
> >> order?
> >>
> >> 3a.)
> >> If it does... immediately push the used element onto the used ring.
> >> Then increment the next expected sequence number and check to see if
> >> any previous out-of-order VirtQueueElements stored on the hash table
> >> has a key that matches this next expected sequence number.
> >>
> >> For each VirtQueueElement found on the hash table with a matching key:
> >> push the element on the used ring, remove the key-value pair from the
> >> hash table, and then increment the next expected sequence number. Repeat
> >> this process until we're unable to find an element with a matching key.
> >>
> >> Note that if the device uses batching (e.g. virtio-net), then we skip
> >> the virtqueue_flush call and let the device call it themselves.
> >>
> >> 3b.)
> >> If it does not... stash the VirtQueueElement, along with relevant data,
> >> as a InOrderVQElement on the hash table. The key used is the order_key
> >> that was assigned when the VirtQueueElement was created.
> >>
> >> Signed-off-by: Jonah Palmer 
> >> ---
> >>   hw/virtio/virtio.c | 70 --
> >>   include/hw/virtio/virtio.h |  8 +
> >>   2 files changed, 76 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >> index 40124545d6..40e4377f1e 100644
> >> --- a/hw/virtio/virtio.c
> >> +++ b/hw/virtio/virtio.c
> >> @@ -992,12 +992,56 @@ void virtqueue_flush(VirtQueue *vq, unsigned int 
> >> count)
> >>   }
> >>   }
> >>
> >> +void virtqueue_order_element(VirtQueue *vq, const VirtQueueElement *elem,
> >> + unsigned int len, unsigned int idx,
> >> + unsigned int count)
> >> +{
> >> +InOrderVQElement *in_order_elem;
> >> +
> >> +if (elem->order_key == vq->current_order_idx) {
> >> +/* Element is in-order, push to used ring */
> >> +virtqueue_fill(vq, elem, len, idx);
> >> +
> >> +/* Batching? Don't flush */
> >> +if (count) {
> >> +virtqueue_flush(vq, count);
> >
> > The "count" parameter is the number of heads used, but here you're
> > only using one head (elem). Same with the other virtqueue_flush in the
> > function.
> >
>
> True. This acts more as a flag than an actual count since, unless we're

Re: [RFC 1/8] virtio: Define InOrderVQElement

2024-03-25 Thread Eugenio Perez Martin

On Mon, Mar 25, 2024 at 6:08 PM Jonah Palmer  wrote:
>
>
>
> On 3/22/24 5:45 AM, Eugenio Perez Martin wrote:
> > On Thu, Mar 21, 2024 at 4:57 PM Jonah Palmer  
> > wrote:
> >>
> >> Define the InOrderVQElement structure for the VIRTIO_F_IN_ORDER
> >> transport feature implementation.
> >>
> >> The InOrderVQElement structure is used to encapsulate out-of-order
> >> VirtQueueElement data that was processed by the host. This data
> >> includes:
> >>   - The processed VirtQueueElement (elem)
> >>   - Length of data (len)
> >>   - VirtQueueElement array index (idx)
> >>   - Number of processed VirtQueueElements (count)
> >>
> >> InOrderVQElements will be stored in a buffering mechanism until an
> >> order can be achieved.
> >>
> >> Signed-off-by: Jonah Palmer 
> >> ---
> >>   include/hw/virtio/virtio.h | 7 +++
> >>   1 file changed, 7 insertions(+)
> >>
> >> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> >> index b3c74a1bca..c8aa435a5e 100644
> >> --- a/include/hw/virtio/virtio.h
> >> +++ b/include/hw/virtio/virtio.h
> >> @@ -77,6 +77,13 @@ typedef struct VirtQueueElement
> >>   struct iovec *out_sg;
> >>   } VirtQueueElement;
> >>
> >> +typedef struct InOrderVQElement {
> >> +const VirtQueueElement *elem;
> >
> > Some subsystems allocate space for extra elements after
> > VirtQueueElement, like VirtIOBlockReq. You can request virtqueue_pop
> > to allocate this extra space by its second argument. Would it work for
> > this?
> >
>
> I don't see why not. Although this may not be necessary due to me
> missing a key aspect mentioned in your comment below.
>
> >> +unsigned int len;
> >> +unsigned int idx;
> >> +unsigned int count;
> >
> > Now I don't get why these fields cannot be obtained from elem->(len,
> > index, ndescs) ?
> >
>
> Interesting. I didn't realize that these values are equivalent to a
> VirtQueueElement's len, index, and ndescs fields.
>
> Is this always true? Else I would've expected, for example,
> virtqueue_push to not need the 'unsigned int len' parameter if this
> information is already included via. the VirtQueueElement being passed in.
>

The code uses "len" to store the written length values of each used
descriptor between virtqueue_fill and virtqueue_flush. But not all
devices use these separately, only the ones that batches: virtio-net
and SVQ.

A smarter / less simpler implementation of virtqueue_push could
certainly avoid storing elem->len. But the performance gain is
probably tiny, and the code complexity grows.

> >> +} InOrderVQElement;
> >> +
> >>   #define VIRTIO_QUEUE_MAX 1024
> >>
> >>   #define VIRTIO_NO_VECTOR 0x
> >> --
> >> 2.39.3
> >>
> >
>

Re: [RFC 0/8] virtio,vhost: Add VIRTIO_F_IN_ORDER support

2024-03-22 Thread Eugenio Perez Martin

On Thu, Mar 21, 2024 at 4:57 PM Jonah Palmer  wrote:
>
> The goal of these patches is to add support to a variety of virtio and
> vhost devices for the VIRTIO_F_IN_ORDER transport feature. This feature
> indicates that all buffers are used by the device in the same order in
> which they were made available by the driver.
>
> These patches attempt to implement a generalized, non-device-specific
> solution to support this feature.
>
> The core feature behind this solution is a buffer mechanism in the form
> of GLib's GHashTable. The decision behind using a hash table was to
> leverage their ability for quick lookup, insertion, and removal
> operations. Given that our keys are simply numbers of an ordered
> sequence, a hash table seemed like the best choice for a buffer
> mechanism.
>
> -
>
> The strategy behind this implementation is as follows:
>
> We know that buffers that are popped from the available ring and enqueued
> for further processing will always done in the same order in which they
> were made available by the driver. Given this, we can note their order
> by assigning the resulting VirtQueueElement a key. This key is a number
> in a sequence that represents the order in which they were popped from
> the available ring, relative to the other VirtQueueElements.
>
> For example, given 3 "elements" that were popped from the available
> ring, we assign a key value to them which represents their order (elem0
> is popped first, then elem1, then lastly elem2):
>
>  elem2   --  elem1   --  elem0   ---> Enqueue for processing
> (key: 2)(key: 1)(key: 0)
>
> Then these elements are enqueued for further processing by the host.
>
> While most devices will return these completed elements in the same
> order in which they were enqueued, some devices may not (e.g.
> virtio-blk). To guarantee that these elements are put on the used ring
> in the same order in which they were enqueued, we can use a buffering
> mechanism that keeps track of the next expected sequence number of an
> element.
>
> In other words, if the completed element does not have a key value that
> matches the next expected sequence number, then we know this element is
> not in-order and we must stash it away in a hash table until an order
> can be made. The element's key value is used as the key for placing it
> in the hash table.
>
> If the completed element has a key value that matches the next expected
> sequence number, then we know this element is in-order and we can push
> it on the used ring. Then we increment the next expected sequence number
> and check if the hash table contains an element at this key location.
>
> If so, we retrieve this element, push it to the used ring, delete the
> key-value pair from the hash table, increment the next expected sequence
> number, and check the hash table again for an element at this new key
> location. This process is repeated until we're unable to find an element
> in the hash table to continue the order.
>
> So, for example, say the 3 elements we enqueued were completed in the
> following order: elem1, elem2, elem0. The next expected sequence number
> is 0:
>
> exp-seq-num = 0:
>
>  elem1   --> elem1.key == exp-seq-num ? --> No, stash it
> (key: 1) |
>  |
>  v
>
>|key: 1 - elem1|
>
> -
> exp-seq-num = 0:
>
>  elem2   --> elem2.key == exp-seq-num ? --> No, stash it
> (key: 2) |
>  |
>  v
>
>|key: 1 - elem1|
>|--|
>|key: 2 - elem2|
>
> -
> exp-seq-num = 0:
>
>  elem0   --> elem0.key == exp-seq-num ? --> Yes, push to used ring
> (key: 0)
>
> exp-seq-num = 1:
>
> lookup(table, exp-seq-num) != NULL ? --> Yes, push to used ring,
>  remove elem from table
>  |
>  v
>
>|key: 2 - elem2|
>
>
> exp-seq-num = 2:
>
> lookup(table, exp-seq-num) != NULL ? --> Yes, push to used ring,
>

Re: [RFC 8/8] virtio: Add VIRTIO_F_IN_ORDER property definition

2024-03-22 Thread Eugenio Perez Martin

On Thu, Mar 21, 2024 at 4:57 PM Jonah Palmer  wrote:
>
> Extend the virtio device property definitions to include the
> VIRTIO_F_IN_ORDER feature.
>
> The default state of this feature is disabled, allowing it to be
> explicitly enabled where it's supported.
>

Acked-by: Eugenio Pérez 

Thanks!

> Signed-off-by: Jonah Palmer 
> ---
>  include/hw/virtio/virtio.h | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> index eeeda397a9..ffd78830a3 100644
> --- a/include/hw/virtio/virtio.h
> +++ b/include/hw/virtio/virtio.h
> @@ -400,7 +400,9 @@ typedef struct VirtIORNGConf VirtIORNGConf;
>  DEFINE_PROP_BIT64("packed", _state, _field, \
>VIRTIO_F_RING_PACKED, false), \
>  DEFINE_PROP_BIT64("queue_reset", _state, _field, \
> -  VIRTIO_F_RING_RESET, true)
> +  VIRTIO_F_RING_RESET, true), \
> +DEFINE_PROP_BIT64("in_order", _state, _field, \
> +  VIRTIO_F_IN_ORDER, false)
>
>  hwaddr virtio_queue_get_desc_addr(VirtIODevice *vdev, int n);
>  bool virtio_queue_enabled_legacy(VirtIODevice *vdev, int n);
> --
> 2.39.3
>

Re: [RFC 7/8] vhost/vhost-user: Add VIRTIO_F_IN_ORDER to vhost feature bits

2024-03-22 Thread Eugenio Perez Martin

On Thu, Mar 21, 2024 at 4:57 PM Jonah Palmer  wrote:
>
> Add support for the VIRTIO_F_IN_ORDER feature across a variety of vhost
> devices.
>
> The inclusion of VIRTIO_F_IN_ORDER in the feature bits arrays for these
> devices ensures that the backend is capable of offering and providing
> support for this feature, and that it can be disabled if the backend
> does not support it.
>

Acked-by: Eugenio Pérez 

Thanks!

> Signed-off-by: Jonah Palmer 
> ---
>  hw/block/vhost-user-blk.c| 1 +
>  hw/net/vhost_net.c   | 2 ++
>  hw/scsi/vhost-scsi.c | 1 +
>  hw/scsi/vhost-user-scsi.c| 1 +
>  hw/virtio/vhost-user-fs.c| 1 +
>  hw/virtio/vhost-user-vsock.c | 1 +
>  net/vhost-vdpa.c | 1 +
>  7 files changed, 8 insertions(+)
>
> diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> index 6a856ad51a..d176ed857e 100644
> --- a/hw/block/vhost-user-blk.c
> +++ b/hw/block/vhost-user-blk.c
> @@ -51,6 +51,7 @@ static const int user_feature_bits[] = {
>  VIRTIO_F_RING_PACKED,
>  VIRTIO_F_IOMMU_PLATFORM,
>  VIRTIO_F_RING_RESET,
> +VIRTIO_F_IN_ORDER,
>  VHOST_INVALID_FEATURE_BIT
>  };
>
> diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
> index e8e1661646..33d1d4b9d3 100644
> --- a/hw/net/vhost_net.c
> +++ b/hw/net/vhost_net.c
> @@ -48,6 +48,7 @@ static const int kernel_feature_bits[] = {
>  VIRTIO_F_IOMMU_PLATFORM,
>  VIRTIO_F_RING_PACKED,
>  VIRTIO_F_RING_RESET,
> +VIRTIO_F_IN_ORDER,
>  VIRTIO_NET_F_HASH_REPORT,
>  VHOST_INVALID_FEATURE_BIT
>  };
> @@ -76,6 +77,7 @@ static const int user_feature_bits[] = {
>  VIRTIO_F_IOMMU_PLATFORM,
>  VIRTIO_F_RING_PACKED,
>  VIRTIO_F_RING_RESET,
> +VIRTIO_F_IN_ORDER,
>  VIRTIO_NET_F_RSS,
>  VIRTIO_NET_F_HASH_REPORT,
>  VIRTIO_NET_F_GUEST_USO4,
> diff --git a/hw/scsi/vhost-scsi.c b/hw/scsi/vhost-scsi.c
> index ae26bc19a4..40e7630191 100644
> --- a/hw/scsi/vhost-scsi.c
> +++ b/hw/scsi/vhost-scsi.c
> @@ -38,6 +38,7 @@ static const int kernel_feature_bits[] = {
>  VIRTIO_RING_F_EVENT_IDX,
>  VIRTIO_SCSI_F_HOTPLUG,
>  VIRTIO_F_RING_RESET,
> +VIRTIO_F_IN_ORDER,
>  VHOST_INVALID_FEATURE_BIT
>  };
>
> diff --git a/hw/scsi/vhost-user-scsi.c b/hw/scsi/vhost-user-scsi.c
> index a63b1f4948..1d59951ab7 100644
> --- a/hw/scsi/vhost-user-scsi.c
> +++ b/hw/scsi/vhost-user-scsi.c
> @@ -36,6 +36,7 @@ static const int user_feature_bits[] = {
>  VIRTIO_RING_F_EVENT_IDX,
>  VIRTIO_SCSI_F_HOTPLUG,
>  VIRTIO_F_RING_RESET,
> +VIRTIO_F_IN_ORDER,
>  VHOST_INVALID_FEATURE_BIT
>  };
>
> diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
> index cca2cd41be..9243dbb128 100644
> --- a/hw/virtio/vhost-user-fs.c
> +++ b/hw/virtio/vhost-user-fs.c
> @@ -33,6 +33,7 @@ static const int user_feature_bits[] = {
>  VIRTIO_F_RING_PACKED,
>  VIRTIO_F_IOMMU_PLATFORM,
>  VIRTIO_F_RING_RESET,
> +VIRTIO_F_IN_ORDER,
>
>  VHOST_INVALID_FEATURE_BIT
>  };
> diff --git a/hw/virtio/vhost-user-vsock.c b/hw/virtio/vhost-user-vsock.c
> index 9431b9792c..cc7e4e47b4 100644
> --- a/hw/virtio/vhost-user-vsock.c
> +++ b/hw/virtio/vhost-user-vsock.c
> @@ -21,6 +21,7 @@ static const int user_feature_bits[] = {
>  VIRTIO_RING_F_INDIRECT_DESC,
>  VIRTIO_RING_F_EVENT_IDX,
>  VIRTIO_F_NOTIFY_ON_EMPTY,
> +VIRTIO_F_IN_ORDER,
>  VHOST_INVALID_FEATURE_BIT
>  };
>
> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> index 85e73dd6a7..ed3185acfa 100644
> --- a/net/vhost-vdpa.c
> +++ b/net/vhost-vdpa.c
> @@ -62,6 +62,7 @@ const int vdpa_feature_bits[] = {
>  VIRTIO_F_RING_PACKED,
>  VIRTIO_F_RING_RESET,
>  VIRTIO_F_VERSION_1,
> +VIRTIO_F_IN_ORDER,
>  VIRTIO_NET_F_CSUM,
>  VIRTIO_NET_F_CTRL_GUEST_OFFLOADS,
>  VIRTIO_NET_F_CTRL_MAC_ADDR,
> --
> 2.39.3
>

Re: [RFC 4/8] virtio: Implement in-order handling for virtio devices

2024-03-22 Thread Eugenio Perez Martin

On Thu, Mar 21, 2024 at 4:57 PM Jonah Palmer  wrote:
>
> Implements in-order handling for most virtio devices using the
> VIRTIO_F_IN_ORDER transport feature, specifically those who call
> virtqueue_push to push their used elements onto the used ring.
>
> The logic behind this implementation is as follows:
>
> 1.) virtqueue_pop always enqueues VirtQueueElements in-order.
>
> virtqueue_pop always retrieves one or more buffer descriptors in-order
> from the available ring and converts them into a VirtQueueElement. This
> means that the order in which VirtQueueElements are enqueued are
> in-order by default.
>
> By virtue, as VirtQueueElements are created, we can assign a sequential
> key value to them. This preserves the order of buffers that have been
> made available to the device by the driver.
>
> As VirtQueueElements are assigned a key value, the current sequence
> number is incremented.
>
> 2.) Requests can be completed out-of-order.
>
> While most devices complete requests in the same order that they were
> enqueued by default, some devices don't (e.g. virtio-blk). The goal of
> this out-of-order handling is to reduce the impact of devices that
> process elements in-order by default while also guaranteeing compliance
> with the VIRTIO_F_IN_ORDER feature.
>
> Below is the logic behind handling completed requests (which may or may
> not be in-order).
>
> 3.) Does the incoming used VirtQueueElement preserve the correct order?
>
> In other words, is the sequence number (key) assigned to the
> VirtQueueElement the expected number that would preserve the original
> order?
>
> 3a.)
> If it does... immediately push the used element onto the used ring.
> Then increment the next expected sequence number and check to see if
> any previous out-of-order VirtQueueElements stored on the hash table
> has a key that matches this next expected sequence number.
>
> For each VirtQueueElement found on the hash table with a matching key:
> push the element on the used ring, remove the key-value pair from the
> hash table, and then increment the next expected sequence number. Repeat
> this process until we're unable to find an element with a matching key.
>
> Note that if the device uses batching (e.g. virtio-net), then we skip
> the virtqueue_flush call and let the device call it themselves.
>
> 3b.)
> If it does not... stash the VirtQueueElement, along with relevant data,
> as a InOrderVQElement on the hash table. The key used is the order_key
> that was assigned when the VirtQueueElement was created.
>
> Signed-off-by: Jonah Palmer 
> ---
>  hw/virtio/virtio.c | 70 --
>  include/hw/virtio/virtio.h |  8 +
>  2 files changed, 76 insertions(+), 2 deletions(-)
>
> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index 40124545d6..40e4377f1e 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -992,12 +992,56 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
>  }
>  }
>
> +void virtqueue_order_element(VirtQueue *vq, const VirtQueueElement *elem,
> + unsigned int len, unsigned int idx,
> + unsigned int count)
> +{
> +InOrderVQElement *in_order_elem;
> +
> +if (elem->order_key == vq->current_order_idx) {
> +/* Element is in-order, push to used ring */
> +virtqueue_fill(vq, elem, len, idx);
> +
> +/* Batching? Don't flush */
> +if (count) {
> +virtqueue_flush(vq, count);

The "count" parameter is the number of heads used, but here you're
only using one head (elem). Same with the other virtqueue_flush in the
function.

Also, this function sometimes replaces virtqueue_fill and other
replaces virtqueue_fill + virtqueue_flush (both examples in patch
6/8). I have the impression the series would be simpler if
virtqueue_order_element is a static function just handling the
virtio_vdev_has_feature(vq->vdev, VIRTIO_F_IN_ORDER) path of
virtqueue_fill, so the caller does not need to know if the in_order
feature is on or off.

> +}
> +
> +/* Increment next expected order, search for more in-order elements 
> */
> +while ((in_order_elem = g_hash_table_lookup(vq->in_order_ht,
> +GUINT_TO_POINTER(++vq->current_order_idx))) != NULL) 
> {
> +/* Found in-order element, push to used ring */
> +virtqueue_fill(vq, in_order_elem->elem, in_order_elem->len,
> +   in_order_elem->idx);
> +
> +/* Batching? Don't flush */
> +if (count) {
> +virtqueue_flush(vq, in_order_elem->count);
> +}
> +
> +/* Remove key-value pair from hash table */
> +g_hash_table_remove(vq->in_order_ht,
> +GUINT_TO_POINTER(vq->current_order_idx));
> +}
> +} else {
> +/* Element is out-of-order, stash in hash table */
> +in_order_elem = virtqueue_alloc_in_order_element(ele

Re: [RFC 1/8] virtio: Define InOrderVQElement

2024-03-22 Thread Eugenio Perez Martin

On Thu, Mar 21, 2024 at 4:57 PM Jonah Palmer  wrote:
>
> Define the InOrderVQElement structure for the VIRTIO_F_IN_ORDER
> transport feature implementation.
>
> The InOrderVQElement structure is used to encapsulate out-of-order
> VirtQueueElement data that was processed by the host. This data
> includes:
>  - The processed VirtQueueElement (elem)
>  - Length of data (len)
>  - VirtQueueElement array index (idx)
>  - Number of processed VirtQueueElements (count)
>
> InOrderVQElements will be stored in a buffering mechanism until an
> order can be achieved.
>
> Signed-off-by: Jonah Palmer 
> ---
>  include/hw/virtio/virtio.h | 7 +++
>  1 file changed, 7 insertions(+)
>
> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> index b3c74a1bca..c8aa435a5e 100644
> --- a/include/hw/virtio/virtio.h
> +++ b/include/hw/virtio/virtio.h
> @@ -77,6 +77,13 @@ typedef struct VirtQueueElement
>  struct iovec *out_sg;
>  } VirtQueueElement;
>
> +typedef struct InOrderVQElement {
> +const VirtQueueElement *elem;

Some subsystems allocate space for extra elements after
VirtQueueElement, like VirtIOBlockReq. You can request virtqueue_pop
to allocate this extra space by its second argument. Would it work for
this?

> +unsigned int len;
> +unsigned int idx;
> +unsigned int count;

Now I don't get why these fields cannot be obtained from elem->(len,
index, ndescs) ?

> +} InOrderVQElement;
> +
>  #define VIRTIO_QUEUE_MAX 1024
>
>  #define VIRTIO_NO_VECTOR 0x
> --
> 2.39.3
>

Re: Intention to work on GSoC project

2024-03-20 Thread Eugenio Perez Martin

On Mon, Mar 18, 2024 at 8:47 PM Sahil  wrote:
>
> Hi,
>
> I was reading the "Virtqueues and virtio ring: How the data travels"
> article [1]. There are a few things that I have not understood in the
> "avail rings" section.
>
> Q1.
> Step 2 in the "Process to make a buffer available" diagram depicts
> how the virtio driver writes the descriptor index in the avail ring.
> In the example, the descriptor index #0 is written in the first entry.
> But in figure 2, the number 0 is in the 4th position in the avail ring.
> Is the avail ring queue an array of "struct virtq_avail" which maintains
> metadata such as the number of descriptor indexes in the header?
>

struct virtq_avail has two members: uint16_t idx and ring[]. To be in
the first position of the avail ring means to be in ring[0] there.

Idx and ring[] are just headers in the figure, not actual positions.
Same as Avail. Now that you mention maybe there is a better way to
represent that, yes.

Let me know if I didn't explain it well.

> Also, in the second position, the number changes from 0 (figure 1) to
> 1 (figure 2). I haven't understood what idx, 0 (later 1) and ring[] represent
> in the figures. Does this number represent the number of descriptors
> that are currently in the avail ring?
>

It is the position in ring[] where the device needs to stop looking
for descriptors. It starts at 0, and when the device sees 1 it means
ring[0] has a descriptor to process.

Now you need to apply a "modulo virtqueue size" to that index. So if
the virtqueue is 256, avail_idx 257 means the last valid descriptor is
at 0. This happens naturally when the driver keeps adding descriptors
and wraps the queue.

The authoritative source of this is the VirtQueues section of the
virtio standard [1], feel free to check it in case it clarifies
something better.

> Q2.
>
> There's this paragraph in the article right below the above mentioned
> diagram:
>
> > The avail ring must be able to hold the same number of descriptors
> > as the descriptor area, and the descriptor area must have a size power
> > of two, so idx wraps naturally at some point. For example, if the ring
> > size is 256 entries, idx 1 references the same descriptor as idx 257, 513...
> > And it will wrap at a 16 bit boundary. This way, neither side needs to
> > worry about processing an invalid idx: They are all valid.
>
> I haven't really understood this. I have understood that idx is calculated
> as idx mod queue_length. But I haven't understood the "16 bit boundary"
> part.
>

avail_idx is an uin16_t, so ((uint16_t)-1) + 1 == 0.

> I am also not very clear on how a queue length that is not a power of 2
> might cause trouble. Could you please expand on this?
>

That's a limitation in the standard, but I'm not sure where it comes
from beyond being computationally easier to calculate ring position
with a mask than with a remainder of a random non-power-of-two number.
Packed virtqueue removes that limitation.

> Q3.
> I have started going through the source code in 
> "drivers/virtio/virtio_ring.c".
> I have understood that the virtio driver runs in the guest's kernel. Does that
> mean the drivers in "drivers/virtio/*" are enabled when linux is being run in
> a guest VM?
>

For PCI devices, as long as it detects a device with vendor == Red
Hat, Inc. (0x1AF4) and device ID 0x1000 through 0x107F inclusive, yes.
You can also load and unload manually with modprobe as other drivers.

Let me know if you have more doubts. Thanks!

[1] https://docs.oasis-open.org/virtio/virtio/v1.3/virtio-v1.3.html

> Thanks,
> Sahil
>
> [1] https://www.redhat.com/en/blog/virtqueues-and-virtio-ring-how-data-travels
>
>
>
>

Re: Intention to work on GSoC project

2024-03-20 Thread Eugenio Perez Martin

On Sat, Mar 16, 2024 at 9:27 PM Sahil  wrote:
>
> Hi,
>
> Thank you for your reply.
>
> On Friday, March 15, 2024 4:57:39 PM IST Eugenio Perez Martin wrote:
> > [...]
> > > Some sections in the above docs were difficult to grasp. For the time
> > > being, I have focused on those parts that I thought were relevant
> > > to the project.
> >
> > Please feel free to ask any questions, maybe we can improve the doc :).
>
> I understood the introductory sections of the documentation such as the
> "About QEMU" section and the first half of the "system emulation". Sections
> and subsections that went into greater detail were a little overwhelming
> such as the "QEMU virtio-net standby" subsection [1] or the "migration
> features" [2] subsection. But the red hat blogs and deep-dive articles helped
> cover a lot of ground conceptually.
>
> I feel once I start getting my hands dirty, I'll be able to absorb these 
> concepts
> much better.
>
> I did have two questions that I would like to ask.
>
> Q1.
> Regarding the "Deep dive into Virtio-networking and vhost-net" article [3],
> the "Introduction" subsection of the "Vhost protocol" section mentions that
> sending the available buffer notification involves a vCPU interrupt (4th 
> bullet
> point).

Now I realize we used a very misleading term there :). Without
ioeventfd, when the guest writes to the PCI notification area the
guest vCPU is totally paused there, and the control is handed to
host's KVM first and QEMU after it. The same physical CPU of the
machine needs to switch context because of that.

Is an interruption of the execution and a context switch. Maybe
"paused" is a better term.

> But in figure 2, the arrow for the "available buffer notification" indicates
> a PCI interrupt. Initially I thought they were two different interrupts but I 
> am
> a little confused about this now.
>

They are different, but at that part of the blog is just the direction
of who interrupts / notifies who :).

> Q2.
> In the "Virtio-net failover operation" section of the "Virtio-net failover: An
> introduction" article [4], there are five bullet points under the first 
> figure.
> The second point states that the guest kernel needs the ability to switch
> between the VFIO device and the vfio-net device. I was wondering if
> "vfio-net" is a typo and if it should be "virtio-net" instead.
>

Good catch :). CCing Laurent, the author of the blog, in case he can
modify the text.

> > [...]
> > There is a post before the first in the series:
> > https://www.redhat.com/en/blog/virtio-devices-and-drivers-overview-headjack-
> > and-phone
>
> Got it. I didn't know this was the first in the series. I have now covered 
> this as
> well, so I can move on to "Virtqueues and virtio ring: How the data travels" 
> [3] :)
>
> > > 1. Virtqueues and virtio ring: How the data travels [8]
> > > 2. Packed virtqueue: How to reduce overhead with virtio [9]
> > > 3. Virtio live migration technical deep dive [10]
> > > 4. Hands on vDPA: what do you do when you ain't got the hardware v2 (Part
> > > 1) [11]
> > I think it's a good plan!
> >
> > If you feel like you're reading a lot of theory and want to get your
> > hands dirty already, you can also start messing with the code with the
> > blogs you already read. Or, maybe, after reading the Packed virtqueue
> > one, your call.
> >
> > In a very brute-forced description, you can start trying to copy all
> > the *packed* stuff of kernel's drivers/virtio/virtio_ring.c into
> > vhost_shadow_virtqueue.c.
>
> I would love to start with some hands-on tasks. I'll take a look at
> the kernel's "drivers/virtio/virtio_ring.c". I think I should also start
> going through the "vhost_shadow_virtqueue.c" [4] source code.
>
> > There is a lot more in the task, and I can get into more detail
> > if you want either here or in a meeting.
>
> Thank you. Either means of communication works for me although
> the latter will require some coordination.
>
> > If you prefer to continue with the theory it is ok too.
>
> A good balance of theory and practice would be nice at this stage.
> It'll keep my brains from getting too muddled up.
>
> Thanks,
> Sahil
>
> [1] https://www.qemu.org/docs/master/system/virtio-net-failover.html
> [2] https://www.qemu.org/docs/master/devel/migration/features.html
> [3] https://www.redhat.com/en/blog/deep-dive-virtio-networking-and-vhost-net
> [4] https://www.redhat.com/en/blog/virtio-net-failover-introduction
> [5] https://www.redhat.com/en/blog/virtqueues-and-virtio-ring-how-data-travels
>
>

Re: [PATCH for-9.0 v3] vdpa-dev: Fix initialisation order to restore VDUSE compatibility

2024-03-19 Thread Eugenio Perez Martin

On Tue, Mar 19, 2024 at 11:00 AM Kevin Wolf  wrote:
>
> Am 18.03.2024 um 20:27 hat Eugenio Perez Martin geschrieben:
> > On Mon, Mar 18, 2024 at 10:02 AM Michael S. Tsirkin  wrote:
> > >
> > > On Mon, Mar 18, 2024 at 12:31:26PM +0800, Jason Wang wrote:
> > > > On Fri, Mar 15, 2024 at 11:59 PM Kevin Wolf  wrote:
> > > > >
> > > > > VDUSE requires that virtqueues are first enabled before the DRIVER_OK
> > > > > status flag is set; with the current API of the kernel module, it is
> > > > > impossible to enable the opposite order in our block export code 
> > > > > because
> > > > > userspace is not notified when a virtqueue is enabled.
> > > > >
> > > > > This requirement also mathces the normal initialisation order as done 
> > > > > by
> > > > > the generic vhost code in QEMU. However, commit 6c482547 accidentally
> > > > > changed the order for vdpa-dev and broke access to VDUSE devices with
> > > > > this.
> > > > >
> > > > > This changes vdpa-dev to use the normal order again and use the 
> > > > > standard
> > > > > vhost callback .vhost_set_vring_enable for this. VDUSE devices can be
> > > > > used with vdpa-dev again after this fix.
> > > > >
> > > > > vhost_net intentionally avoided enabling the vrings for vdpa and does
> > > > > this manually later while it does enable them for other vhost 
> > > > > backends.
> > > > > Reflect this in the vhost_net code and return early for vdpa, so that
> > > > > the behaviour doesn't change for this device.
> > > > >
> > > > > Cc: qemu-sta...@nongnu.org
> > > > > Fixes: 6c4825476a4351530bcac17abab72295b75ffe98
> > > > > Signed-off-by: Kevin Wolf 
> > > > > ---
> > > > > v2:
> > > > > - Actually make use of the @enable parameter
> > > > > - Change vhost_net to preserve the current behaviour
> > > > >
> > > > > v3:
> > > > > - Updated trace point [Stefano]
> > > > > - Fixed typo in comment [Stefano]
> > > > >
> > > > >  hw/net/vhost_net.c | 10 ++
> > > > >  hw/virtio/vdpa-dev.c   |  5 +
> > > > >  hw/virtio/vhost-vdpa.c | 29 ++---
> > > > >  hw/virtio/vhost.c  |  8 +++-
> > > > >  hw/virtio/trace-events |  2 +-
> > > > >  5 files changed, 45 insertions(+), 9 deletions(-)
> > > > >
> > > > > diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
> > > > > index e8e1661646..fd1a93701a 100644
> > > > > --- a/hw/net/vhost_net.c
> > > > > +++ b/hw/net/vhost_net.c
> > > > > @@ -541,6 +541,16 @@ int vhost_set_vring_enable(NetClientState *nc, 
> > > > > int enable)
> > > > >  VHostNetState *net = get_vhost_net(nc);
> > > > >  const VhostOps *vhost_ops = net->dev.vhost_ops;
> > > > >
> > > > > +/*
> > > > > + * vhost-vdpa network devices need to enable dataplane 
> > > > > virtqueues after
> > > > > + * DRIVER_OK, so they can recover device state before starting 
> > > > > dataplane.
> > > > > + * Because of that, we don't enable virtqueues here and leave it 
> > > > > to
> > > > > + * net/vhost-vdpa.c.
> > > > > + */
> > > > > +if (nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
> > > > > +return 0;
> > > > > +}
> > > >
> > > > I think we need some inputs from Eugenio, this is only needed for
> > > > shadow virtqueue during live migration but not other cases.
> > > >
> > > > Thanks
> > >
> > >
> > > Yes I think we had a backend flag for this, right? Eugenio can you
> > > comment please?
> > >
> >
> > We have the VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK backend flag,
> > right. If the backend does not offer it, it is better to enable all
> > the queues here and add a migration blocker in net/vhost-vdpa.c.
> >
> > So the check should be:
> > nc->info->type == VHOST_VDPA && (backend_features &
> > VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK).
> >
> > I can manage to add the migration blocker on top of this patch.
>
> Note that my patch preserves the current behaviour for vhost_net. The
> callback wasn't implemented for vdpa so far, so we never called anything
> even if the flag wasn't set. This patch adds an implementation for the
> callback, so we have to skip it here to have everything in vhost_net
> work as before - which is what the condition as written does.
>
> If we add a check for the flag now (I don't know if that's correct or
> not), that would be a second, unrelated change of behaviour in the same
> patch. So if it's necessary, that's a preexisting problem and I'd argue
> it doesn't belong in this patch, but should be done separately.
>

Right, that's a very good point. I'll add proper checking on top of
your patch when it is merged.

Reviewed-by: Eugenio Pérez 

Thanks!

Re: [PATCH for-9.0 v3] vdpa-dev: Fix initialisation order to restore VDUSE compatibility

2024-03-18 Thread Eugenio Perez Martin

On Mon, Mar 18, 2024 at 10:02 AM Michael S. Tsirkin  wrote:
>
> On Mon, Mar 18, 2024 at 12:31:26PM +0800, Jason Wang wrote:
> > On Fri, Mar 15, 2024 at 11:59 PM Kevin Wolf  wrote:
> > >
> > > VDUSE requires that virtqueues are first enabled before the DRIVER_OK
> > > status flag is set; with the current API of the kernel module, it is
> > > impossible to enable the opposite order in our block export code because
> > > userspace is not notified when a virtqueue is enabled.
> > >
> > > This requirement also mathces the normal initialisation order as done by
> > > the generic vhost code in QEMU. However, commit 6c482547 accidentally
> > > changed the order for vdpa-dev and broke access to VDUSE devices with
> > > this.
> > >
> > > This changes vdpa-dev to use the normal order again and use the standard
> > > vhost callback .vhost_set_vring_enable for this. VDUSE devices can be
> > > used with vdpa-dev again after this fix.
> > >
> > > vhost_net intentionally avoided enabling the vrings for vdpa and does
> > > this manually later while it does enable them for other vhost backends.
> > > Reflect this in the vhost_net code and return early for vdpa, so that
> > > the behaviour doesn't change for this device.
> > >
> > > Cc: qemu-sta...@nongnu.org
> > > Fixes: 6c4825476a4351530bcac17abab72295b75ffe98
> > > Signed-off-by: Kevin Wolf 
> > > ---
> > > v2:
> > > - Actually make use of the @enable parameter
> > > - Change vhost_net to preserve the current behaviour
> > >
> > > v3:
> > > - Updated trace point [Stefano]
> > > - Fixed typo in comment [Stefano]
> > >
> > >  hw/net/vhost_net.c | 10 ++
> > >  hw/virtio/vdpa-dev.c   |  5 +
> > >  hw/virtio/vhost-vdpa.c | 29 ++---
> > >  hw/virtio/vhost.c  |  8 +++-
> > >  hw/virtio/trace-events |  2 +-
> > >  5 files changed, 45 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
> > > index e8e1661646..fd1a93701a 100644
> > > --- a/hw/net/vhost_net.c
> > > +++ b/hw/net/vhost_net.c
> > > @@ -541,6 +541,16 @@ int vhost_set_vring_enable(NetClientState *nc, int 
> > > enable)
> > >  VHostNetState *net = get_vhost_net(nc);
> > >  const VhostOps *vhost_ops = net->dev.vhost_ops;
> > >
> > > +/*
> > > + * vhost-vdpa network devices need to enable dataplane virtqueues 
> > > after
> > > + * DRIVER_OK, so they can recover device state before starting 
> > > dataplane.
> > > + * Because of that, we don't enable virtqueues here and leave it to
> > > + * net/vhost-vdpa.c.
> > > + */
> > > +if (nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
> > > +return 0;
> > > +}
> >
> > I think we need some inputs from Eugenio, this is only needed for
> > shadow virtqueue during live migration but not other cases.
> >
> > Thanks
>
>
> Yes I think we had a backend flag for this, right? Eugenio can you
> comment please?
>

We have the VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK backend flag,
right. If the backend does not offer it, it is better to enable all
the queues here and add a migration blocker in net/vhost-vdpa.c.

So the check should be:
nc->info->type == VHOST_VDPA && (backend_features &
VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK).

I can manage to add the migration blocker on top of this patch.

Thanks!

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1016 matches

Mail list logo