Re: [Qemu-devel] qemu-user-linux: how could I measure performance for aarch64 and arm?
пт, 11 янв. 2019 г. в 22:24, Matwey V. Kornilov : > > пт, 11 янв. 2019 г. в 12:52, Peter Maydell : > > > > On Thu, 10 Jan 2019 at 19:33, Matwey V. Kornilov > > wrote: > > > I am running the same application compiled for aarch64 and armv7l on > > > x86_64 platform using qemu-user-linux tools. > > > > > > I see dramatic performance difference (30 times) between emulated > > > architectures: aarch64 runs for ~4 minutes, armv7l runs for ~2 hours. > > > I do understand that CPU architecture emulation is inherently slow > > > thing, but my question is about the difference. > > > > > > How could I debug to understand what is the reason for such a big > > > difference? I've already tried to run stress-ng compiled for this two > > > architectures, but it leads to the same performance per second. > > > > > > I am running qemu 2.11, should I try other version? > > > > Yes, do try 3.1 -- we have done some overall TCG performance > > improvements. > > Indeed, qemu-arm from master runs for 4 minutes where 2.11 runs for 2 > hours for me. It is impressive improvement. I've managed to bisected the first good (fast) commit: commit 2a53535af471f4bee9d6cb5b363746b8d5ed21dd Author: Luke Shumaker Date: Thu Dec 28 13:08:13 2017 -0500 linux-user: init_guest_space: Try to make ARM space+commpage continuous Though I am not sure, how does it help. > > > > > For a big difference between target architectures like that, > > I would try starting by using some host performance tools on > > the two runs to see where all the time is being taken in > > the armv7l guest run -- is it all in translated guest code, > > or is there more time (proportionally) spent in particular > > parts of the QEMU C code? Does the armv7l version do > > many more or different syscalls (check with the QEMU -strace > > option) ? > > > > Also you should check performance on h/w 32 bit vs > > 64-bit Arm if you can, to confirm that it's not just > > that the guest application runs much slower there. > > (If you don't have the arm hardware you could at least > > check x86 32-bit vs 64-bit.) > > > > thanks > > -- PMM > > > > -- > With best regards, > Matwey V. Kornilov -- With best regards, Matwey V. Kornilov
Re: [Qemu-devel] [PATCH 1/2 v2] usb-mtp: Reallocate buffer in multiples of MTP_WRITE_BUF_SZ
Eric Blake writes: > On 1/11/19 2:20 AM, Bandan Das wrote: >> This is a "pre-patch" to breaking up the write buffer for >> MTP writes. Instead of allocating a mtp buffer equal to size >> sent by the initiator, we start with a small size and reallocate >> multiples (of that small size) as needed. >> >> Signed-off-by: Bandan Das >> --- >> hw/usb/dev-mtp.c | 26 -- >> 1 file changed, 12 insertions(+), 14 deletions(-) >> >> diff --git a/hw/usb/dev-mtp.c b/hw/usb/dev-mtp.c >> index b19b576278..a0d98c93ee 100644 >> --- a/hw/usb/dev-mtp.c >> +++ b/hw/usb/dev-mtp.c >> @@ -152,7 +152,6 @@ struct MTPData { >> bool first; >> /* Used for >4G file sizes */ >> bool pending; >> -uint64_t cached_length; >> int fd; >> }; >> >> @@ -244,6 +243,7 @@ typedef struct { >> >> #define MTP_MANUFACTURER "QEMU" >> #define MTP_PRODUCT "QEMU filesharing" >> +#define MTP_WRITE_BUF_SZ 512000 > > Why not a power of two? Perhaps use units.h and spell it (512 * KiB) ? Sure, I will change it in a later patch. Bandan
Re: [Qemu-devel] [PATCH 2/4] linuxboot_dma: move common functions in a new header
On 1/12/19 12:25 PM, Stefano Garzarella wrote: >>> Better to just pull in qemu/osdep.h >> >> Except that qemu/osdep.h should already have been pulled in by whatever >> .c file is including this header. We specifically document that .h files >> shouldn't need to include osdep.h (and in turn, anything that osdep.h >> already pulls in, like ). > > Since I can't include qemu/osdep.h because this header file is used by > the option roms (bare metal), do you think is better to include > stdint.h in optrom.h or in the .c files that use it? > As Stefan pointed out, I'm including qemu_fw_cfg.h in optrom.h and it > depends on stdint.h. In this case, what is the best approach? If this is one of the exception files that is used to build files outside of qemu, then its should probably be mentioned as an exception in scripts/clean-includes (okay, I see that pc-bios/ is already exempted) - and thus using osdep.h is not an option, but using is, because it is the exception to the rule, in relation to files used only for qemu. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [PATCH v3 0/5] kvm "virtio pmem" device
> > > > > > > > > > > > On Thu 10-01-19 12:26:17, Dave Chinner wrote: > > > > On Wed, Jan 09, 2019 at 08:17:31PM +0530, Pankaj Gupta wrote: > > > > > This patch series has implementation for "virtio pmem". > > > > > "virtio pmem" is fake persistent memory(nvdimm) in guest > > > > > which allows to bypass the guest page cache. This also > > > > > implements a VIRTIO based asynchronous flush mechanism. > > > > > > > > H. Sharing the host page cache direct into the guest VM. Sounds > > > > like a good idea, but. > > > > > > > > This means the guest VM can now run timing attacks to observe host > > > > side page cache residency, and depending on the implementation I'm > > > > guessing that the guest will be able to control host side page > > > > cache eviction, too (e.g. via discard or hole punch operations). > > > > > > > > Which means this functionality looks to me like a new vector for > > > > information leakage into and out of the guest VM via guest > > > > controlled host page cache manipulation. > > > > > > > > https://arxiv.org/pdf/1901.01161 > > > > > > > > I might be wrong, but if I'm not we're going to have to be very > > > > careful about how guest VMs can access and manipulate host side > > > > resources like the page cache. > > > > > > Right. Thinking about this I would be more concerned about the fact that > > > guest can effectively pin amount of host's page cache upto size of the > > > device/file passed to guest as PMEM, can't it Pankaj? Or is there some > > > QEMU > > > magic that avoids this? > > > > Yes, guest will pin these host page cache pages using 'get_user_pages' by > > elevating the page reference count. But these pages can be reclaimed by > > host > > at any time when there is memory pressure. > > Wait, how can the guest pin the host pages? I would expect this to > happen only when using vfio and device assignment. Otherwise, no the > host can't reclaim a pinned page, that's the whole point of a pin to > prevent the mm from reclaiming ownership. yes. You are right I just used the pin word but it does not actually pin pages permanently. I had gone through the discussion on existing problems with get_user_pages and DMA e.g [1] to understand Jan's POV. It does mention GUP pin pages so I also used the word 'pin'. But guest does not permanently pin these pages and these pages can be reclaimed by host. > > > KVM does not permanently pin pages. vfio does that but we are not using > > it here. > > Right, so I'm confused by your pin assertion above. Sorry! for the confusion. [1] https://lwn.net/Articles/753027/ Thanks, Pankaj
Re: [Qemu-devel] [PATCH v3 0/5] kvm "virtio pmem" device
On Sat, Jan 12, 2019 at 5:38 PM Pankaj Gupta wrote: > > > > > > > On Thu 10-01-19 12:26:17, Dave Chinner wrote: > > > On Wed, Jan 09, 2019 at 08:17:31PM +0530, Pankaj Gupta wrote: > > > > This patch series has implementation for "virtio pmem". > > > > "virtio pmem" is fake persistent memory(nvdimm) in guest > > > > which allows to bypass the guest page cache. This also > > > > implements a VIRTIO based asynchronous flush mechanism. > > > > > > H. Sharing the host page cache direct into the guest VM. Sounds > > > like a good idea, but. > > > > > > This means the guest VM can now run timing attacks to observe host > > > side page cache residency, and depending on the implementation I'm > > > guessing that the guest will be able to control host side page > > > cache eviction, too (e.g. via discard or hole punch operations). > > > > > > Which means this functionality looks to me like a new vector for > > > information leakage into and out of the guest VM via guest > > > controlled host page cache manipulation. > > > > > > https://arxiv.org/pdf/1901.01161 > > > > > > I might be wrong, but if I'm not we're going to have to be very > > > careful about how guest VMs can access and manipulate host side > > > resources like the page cache. > > > > Right. Thinking about this I would be more concerned about the fact that > > guest can effectively pin amount of host's page cache upto size of the > > device/file passed to guest as PMEM, can't it Pankaj? Or is there some QEMU > > magic that avoids this? > > Yes, guest will pin these host page cache pages using 'get_user_pages' by > elevating the page reference count. But these pages can be reclaimed by host > at any time when there is memory pressure. Wait, how can the guest pin the host pages? I would expect this to happen only when using vfio and device assignment. Otherwise, no the host can't reclaim a pinned page, that's the whole point of a pin to prevent the mm from reclaiming ownership. > KVM does not permanently pin pages. vfio does that but we are not using > it here. Right, so I'm confused by your pin assertion above.
Re: [Qemu-devel] [PATCH v3 0/5] kvm "virtio pmem" device
> > On Thu 10-01-19 12:26:17, Dave Chinner wrote: > > On Wed, Jan 09, 2019 at 08:17:31PM +0530, Pankaj Gupta wrote: > > > This patch series has implementation for "virtio pmem". > > > "virtio pmem" is fake persistent memory(nvdimm) in guest > > > which allows to bypass the guest page cache. This also > > > implements a VIRTIO based asynchronous flush mechanism. > > > > H. Sharing the host page cache direct into the guest VM. Sounds > > like a good idea, but. > > > > This means the guest VM can now run timing attacks to observe host > > side page cache residency, and depending on the implementation I'm > > guessing that the guest will be able to control host side page > > cache eviction, too (e.g. via discard or hole punch operations). > > > > Which means this functionality looks to me like a new vector for > > information leakage into and out of the guest VM via guest > > controlled host page cache manipulation. > > > > https://arxiv.org/pdf/1901.01161 > > > > I might be wrong, but if I'm not we're going to have to be very > > careful about how guest VMs can access and manipulate host side > > resources like the page cache. > > Right. Thinking about this I would be more concerned about the fact that > guest can effectively pin amount of host's page cache upto size of the > device/file passed to guest as PMEM, can't it Pankaj? Or is there some QEMU > magic that avoids this? Yes, guest will pin these host page cache pages using 'get_user_pages' by elevating the page reference count. But these pages can be reclaimed by host at any time when there is memory pressure. KVM does not permanently pin pages. vfio does that but we are not using it here. Could you please elaborate what you are thinking? Thanks, Pankaj
[Qemu-devel] [Bug 1811543] [NEW] virtio-scsi gives improper discard sysfs entries
Public bug reported: Apologies if this is just an inherent part of paravirtualization that should be expected. In my host, I have an LVM thin pool with chunk_size 128MB. Within it, I have a thin volume "tmp". In the host: # fdisk -l /dev/lvm/tmp Disk /dev/lvm/tmp: 256 MiB, 268435456 bytes, 524288 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 262144 bytes / 134217728 bytes Disklabel type: gpt Disk identifier: BAE3154E-6E85-F642-8129-BAD7B58B2775 DeviceStartEnd Sectors Size Type /dev/lvm/tmp1 2048 524254 522207 255M Linux filesystem $ lsblk ... └─lvm-tmp 254:13 0 256M 0 lvm └─lvm-tmp1 254:14 0 255M 0 part $ cat /sys/dev/block/254:13/discard_alignment 0 $ cat /sys/dev/block/254:13/queue/discard_granularity 134217728 $ cat /sys/dev/block/254:13/queue/discard_max_bytes 17179869184 $ cat /sys/dev/block/254:13/queue/discard_max_hw_bytes 0 $ cat /sys/dev/block/254:13/queue/discard_zeroes_data 0 $ cat /sys/dev/block/254:14/discard_alignment 133169152 $ cat /sys/dev/block/254:14/queue/discard_granularity 134217728 $ cat /sys/dev/block/254:14/queue/discard_max_bytes 17179869184 $ cat /sys/dev/block/254:14/queue/discard_max_hw_bytes 0 $ cat /sys/dev/block/254:14/queue/discard_zeroes_data 0 If this is given to QEMU using virtio-scsi: -device virtio-scsi-pci,id=scsi1 \ -drive driver=raw,node-name=hdb,file=/dev/lvm/tmp,if=none,discard=unmap,id=hd2 \ -device scsi-hd,drive=hd2,bootindex=1 \ Then incorrect values are given: $ lsblk ... sdb 8:16 0 256M 0 disk └─sdb1 8:17 0 255M 0 part /mnt $ cat /sys/dev/block/8:16/discard_alignment 0 $ cat /sys/dev/block/8:16/queue/discard_granularity 4096 $ cat /sys/dev/block/8:16/queue/discard_max_bytes 1073741824 $ cat /sys/dev/block/8:16/queue/discard_max_hw_bytes 1073741824 $ cat /sys/dev/block/8:16/queue/discard_zeroes_data 0 $ cat /sys/dev/block/8:17/discard_alignment 133169152 And, there isn't even a /sys/dev/block/8:17/queue direcotry. ** Affects: qemu Importance: Undecided Status: New -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1811543 Title: virtio-scsi gives improper discard sysfs entries Status in QEMU: New Bug description: Apologies if this is just an inherent part of paravirtualization that should be expected. In my host, I have an LVM thin pool with chunk_size 128MB. Within it, I have a thin volume "tmp". In the host: # fdisk -l /dev/lvm/tmp Disk /dev/lvm/tmp: 256 MiB, 268435456 bytes, 524288 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 262144 bytes / 134217728 bytes Disklabel type: gpt Disk identifier: BAE3154E-6E85-F642-8129-BAD7B58B2775 DeviceStartEnd Sectors Size Type /dev/lvm/tmp1 2048 524254 522207 255M Linux filesystem $ lsblk ... └─lvm-tmp 254:13 0 256M 0 lvm └─lvm-tmp1 254:14 0 255M 0 part $ cat /sys/dev/block/254:13/discard_alignment 0 $ cat /sys/dev/block/254:13/queue/discard_granularity 134217728 $ cat /sys/dev/block/254:13/queue/discard_max_bytes 17179869184 $ cat /sys/dev/block/254:13/queue/discard_max_hw_bytes 0 $ cat /sys/dev/block/254:13/queue/discard_zeroes_data 0 $ cat /sys/dev/block/254:14/discard_alignment 133169152 $ cat /sys/dev/block/254:14/queue/discard_granularity 134217728 $ cat /sys/dev/block/254:14/queue/discard_max_bytes 17179869184 $ cat /sys/dev/block/254:14/queue/discard_max_hw_bytes 0 $ cat /sys/dev/block/254:14/queue/discard_zeroes_data 0 If this is given to QEMU using virtio-scsi: -device virtio-scsi-pci,id=scsi1 \ -drive driver=raw,node-name=hdb,file=/dev/lvm/tmp,if=none,discard=unmap,id=hd2 \ -device scsi-hd,drive=hd2,bootindex=1 \ Then incorrect values are given: $ lsblk ... sdb 8:16 0 256M 0 disk └─sdb1 8:17 0 255M 0 part /mnt $ cat /sys/dev/block/8:16/discard_alignment 0 $ cat /sys/dev/block/8:16/queue/discard_granularity 4096 $ cat /sys/dev/block/8:16/queue/discard_max_bytes 1073741824 $ cat /sys/dev/block/8:16/queue/discard_max_hw_bytes 1073741824 $ cat /sys/dev/block/8:16/queue/discard_zeroes_data 0 $ cat /sys/dev/block/8:17/discard_alignment 133169152 And, there isn't even a /sys/dev/block/8:17/queue direcotry. To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/1811543/+subscriptions
[Qemu-devel] Various GTK window display problems - sorting bugs from features
Hello List, After filing a Debian bug, the package maintainer gave a hint, that this might be more relevant here: There seems to exist a (minor) usability issue with GTK frontend and detaching a window, that seems to lead to full loss of keyboard input to the guest, in the end. With the first issue (usability), it is not clear if this is intended behaviour. Maybe someone can give a short opinion to finish this one as "little unexpected feature". See also [0]. When running a Qemu machine with "-display gtk" then selecting Menu-bar -> View -> Detach tab and Menu-bar -> View -> Fullscreen will then display the menu bar into fullscreen mode, not the virtual machine window. As it is impossible to bring the detached window to fullscreen mode, it has to be reattached to the main windows again, This can be done by closing the windows. Everything looks nice afterwards but keybord events are not forwarded to the guest any more (mouse is still working). There is no difference in behaviour when GTK window is scaling the guest screen, guest display is in full screen mode or not, input is grabbed. Using the qemu monitor (with "-monitor stdio") still transports the keystrokes, so the guest seems to be working still. (qemu) sendkey x (qemu) sendkey ret Can anyone reproduce this behaviour? See also [1]. Background: the initial reason for exploring all window/fullscreen variants was, that the mouse input stopped working with the sdl -> gtk switch. The Debian package maintainer seems very responsive and commited, so we are still sorting this one out in [2]. Thanks for any feedback, opinions, hd [0] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=919056 [1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=919116 [2] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=919057
[Qemu-devel] [Bug 1811533] [NEW] Unstable Win10 guest with qemu 3.1 + huge pages + hv_stimer
Public bug reported: Host: Gentoo linux x86_64, kernel 4.20.1 Qemu 3.1.0 CPU: Intel i7 6850K Chipset: X99 Guest: Windows 10 Pro 64bit (1809) Machine type: pc-q35_3.1 Hyper-V enlightenments: hv_stimer,hv_reenlightenment,hv_frequencies,hv_vapic,hv_reset,hv_synic,hv_runtime,hv_vpindex,hv_time,hv_relaxed,hv_spinlocks=0x1fff Memory: 16GB backed by 2MB huge pages Issue: Once guest is started, log gets flooded with: qemu-system-x86_64: vhost_region_add_section: Overlapping but not coherent sections at 103000 or qemu-system-x86_64: vhost_region_add_section:Section rounded to 0 prior to previous 1f000 (line endings change) and as time goes guest loses network access (virtio-net-pci) and general performance diminishes to extent of freezing applications. Observations: 1) problem disappears when hv_stimer is removed 2) problem disappears when memory backing with huge pages is disabled 3) problem disappears when machine type is downgraded to pc-q35_3.0 ** Affects: qemu Importance: Undecided Status: New -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1811533 Title: Unstable Win10 guest with qemu 3.1 + huge pages + hv_stimer Status in QEMU: New Bug description: Host: Gentoo linux x86_64, kernel 4.20.1 Qemu 3.1.0 CPU: Intel i7 6850K Chipset: X99 Guest: Windows 10 Pro 64bit (1809) Machine type: pc-q35_3.1 Hyper-V enlightenments: hv_stimer,hv_reenlightenment,hv_frequencies,hv_vapic,hv_reset,hv_synic,hv_runtime,hv_vpindex,hv_time,hv_relaxed,hv_spinlocks=0x1fff Memory: 16GB backed by 2MB huge pages Issue: Once guest is started, log gets flooded with: qemu-system-x86_64: vhost_region_add_section: Overlapping but not coherent sections at 103000 or qemu-system-x86_64: vhost_region_add_section:Section rounded to 0 prior to previous 1f000 (line endings change) and as time goes guest loses network access (virtio-net-pci) and general performance diminishes to extent of freezing applications. Observations: 1) problem disappears when hv_stimer is removed 2) problem disappears when memory backing with huge pages is disabled 3) problem disappears when machine type is downgraded to pc-q35_3.0 To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/1811533/+subscriptions
Re: [Qemu-devel] [PATCH 2/4] linuxboot_dma: move common functions in a new header
On Fri, Jan 11, 2019 at 6:55 PM Eric Blake wrote: > > On 1/11/19 11:48 AM, Michael S. Tsirkin wrote: > > >>> > diff --git a/pc-bios/optionrom/optrom.h b/pc-bios/optionrom/optrom.h > new file mode 100644 > index 00..36f43b43fd > --- /dev/null > +++ b/pc-bios/optionrom/optrom.h > > +#include "../../include/standard-headers/linux/qemu_fw_cfg.h" > >>> > >>> This depends on , please include it first. > >> > >> Sure. > >> > >> > >> Thanks, > >> Stefano > > > > Better to just pull in qemu/osdep.h > > Except that qemu/osdep.h should already have been pulled in by whatever > .c file is including this header. We specifically document that .h files > shouldn't need to include osdep.h (and in turn, anything that osdep.h > already pulls in, like ). Since I can't include qemu/osdep.h because this header file is used by the option roms (bare metal), do you think is better to include stdint.h in optrom.h or in the .c files that use it? As Stefan pointed out, I'm including qemu_fw_cfg.h in optrom.h and it depends on stdint.h. In this case, what is the best approach? Thanks, Stefano
Re: [Qemu-devel] [PATCH 2/4] linuxboot_dma: move common functions in a new header
On Fri, Jan 11, 2019 at 6:48 PM Michael S. Tsirkin wrote: > > > > +#include "../../include/standard-headers/linux/qemu_fw_cfg.h" > > > > > > This depends on , please include it first. > > Better to just pull in qemu/osdep.h This header is used in the option roms that are bare-metal, so IMHO I can't include qemu/osdep.h here or in the option roms. Thanks, Stefano
[Qemu-devel] [PATCH v3 00/19] nbd: add qemu-nbd --list
I got tired of debugging whether a server was advertising the correct things during negotiation by inspecting the trace logs of qemu-io as client - not to mention that without SOME sort of client tracing particular commands, we can't easily regression test the server for correct behavior. The final straw was at KVM Forum, when Nir asked me to make sure there was a way to easily determine if an NBD server is exposing what we really want (and fixing x-dirty-bitmap to behave saner fell out as a result of answering that question). I note that upstream NBD has 'nbd-client -l $host' for querying just export names (with no quoting, so you have to know that a blank line means the default export), but it wasn't powerful enough, so I implemented 'qemu-nbd -L' to document everything. Upstream NBD has separate 'nbd-client' and 'nbd-server' binaries, while we only have 'qemu-nbd' (which is normally just a server, but 'qemu-nbd -c' also operates a second thread as a client). Our other uses of qemu as NBD client are for consuming a block device (as in qemu-io, qemu-img, or a drive to qemu) - but those binaries are less suited to something so specific to the NBD protocol. Bonus: As a result of my work on this series, nbdkit now supports NBD_OPT_INFO (my interoperability testing between server implementations has been paying off, both at fixing server bugs, and at making this code more reliable across difference in valid servers). Also available at: https://repo.or.cz/qemu/ericb.git qemu-nbd-list-v2 Based-on: <20181221093529.23855-1-js...@redhat.com> [jsnow: 0/11 bitmaps: remove x- prefix from QMP api] Based-on: <2019063519.11457-1-phi...@redhat.com> [philmd: qemu-nbd: Rename 'exp' variable clashing with math::exp() symbol] Based-on: <2019094720.15671-1-ebl...@redhat.com> [eblake: 0/8 Promote x-nbd-server-add-bitmap to stable] Since v2: - Several patches merged already - 3 new patches based on audit of off_t vs. strtol - rebase patches on top of other changes, such as qemu-nbd --bitmap - address various review comments [Vladimir, Rich] - drop patch 12/22 001/19:[0020] [FC] 'maint: Allow for EXAMPLES in texi2pod' 002/19:[0030] [FC] 'qemu-nbd: Enhance man page' 003/19:[down] 'qemu-nbd: Sanity check partition bounds' 004/19:[down] 'nbd/server: Hoist length check to qemp_nbd_server_add' 005/19:[down] 'nbd/server: Favor [u]int64_t over off_t' 006/19:[0007] [FC] 'qemu-nbd: Avoid strtol open-coding' 007/19:[0016] [FC] 'nbd/client: Refactor nbd_receive_list()' 008/19:[] [--] 'nbd/client: Move export name into NBDExportInfo' 009/19:[] [--] 'nbd/client: Change signature of nbd_negotiate_simple_meta_context()' 010/19:[0007] [FC] 'nbd/client: Split out nbd_send_one_meta_context()' 011/19:[0048] [FC] 'nbd/client: Split out nbd_receive_one_meta_context()' 012/19:[] [--] 'nbd/client: Refactor return of nbd_receive_negotiate()' 013/19:[] [-C] 'nbd/client: Split handshake into two functions' 014/19:[] [--] 'nbd/client: Pull out oldstyle size determination' 015/19:[0008] [FC] 'nbd/client: Add nbd_receive_export_list()' 016/19:[] [-C] 'nbd/client: Add meta contexts to nbd_receive_export_list()' 017/19:[0015] [FC] 'qemu-nbd: Add --list option' 018/19:[] [--] 'nbd/client: Work around 3.0 bug for listing meta contexts' 019/19:[0002] [FC] 'iotests: Enhance 223, 233 to cover 'qemu-nbd --list'' Eric Blake (19): maint: Allow for EXAMPLES in texi2pod qemu-nbd: Enhance man page qemu-nbd: Sanity check partition bounds nbd/server: Hoist length check to qemp_nbd_server_add nbd/server: Favor [u]int64_t over off_t qemu-nbd: Avoid strtol open-coding nbd/client: Refactor nbd_receive_list() nbd/client: Move export name into NBDExportInfo nbd/client: Change signature of nbd_negotiate_simple_meta_context() nbd/client: Split out nbd_send_one_meta_context() nbd/client: Split out nbd_receive_one_meta_context() nbd/client: Refactor return of nbd_receive_negotiate() nbd/client: Split handshake into two functions nbd/client: Pull out oldstyle size determination nbd/client: Add nbd_receive_export_list() nbd/client: Add meta contexts to nbd_receive_export_list() qemu-nbd: Add --list option nbd/client: Work around 3.0 bug for listing meta contexts iotests: Enhance 223, 233 to cover 'qemu-nbd --list' qemu-nbd.texi | 114 -- Makefile | 2 + include/block/nbd.h| 31 +- block/nbd-client.c | 9 +- blockdev-nbd.c | 10 +- nbd/client.c | 756 ++--- nbd/server.c | 23 +- qemu-nbd.c | 221 --- nbd/trace-events | 11 +- scripts/texi2pod.pl| 2 +- tests/qemu-iotests/223 | 2 + tests/qemu-iotests/223.out | 20 + tests/qemu-iotests/233 | 19 +- tests/qemu-iotests/233.out | 15 + 14 files changed, 910 insertions(+), 325 deletions(-) -- 2.20.1
[Qemu-devel] [PATCH v3 01/19] maint: Allow for EXAMPLES in texi2pod
The next commit will add an EXAMPLES section to qemu-nbd.8; for that to work, we need to recognize EXAMPLES in texi2pod. We also need to add a dependency from all man pages against the generator script, since a change to the generator may cause the resulting man page to differ. Signed-off-by: Eric Blake --- v3: add generic dependency for all man pages in $(DOCS) instead of per-line editing [Vladimir] --- Makefile| 2 ++ scripts/texi2pod.pl | 2 +- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/Makefile b/Makefile index a9ac16d94e8..e2d3ace190a 100644 --- a/Makefile +++ b/Makefile @@ -857,6 +857,8 @@ docs/interop/qemu-qmp-ref.dvi docs/interop/qemu-qmp-ref.html \ docs/interop/qemu-qmp-ref.txt docs/interop/qemu-qmp-ref.7: \ docs/interop/qemu-qmp-ref.texi docs/interop/qemu-qmp-qapi.texi +$(filter %.1 %.7 %.8,$(DOCS)): scripts/texi2pod.pl + # Reports/Analysis %/coverage-report.html: diff --git a/scripts/texi2pod.pl b/scripts/texi2pod.pl index 39ce584a322..839b7917cf7 100755 --- a/scripts/texi2pod.pl +++ b/scripts/texi2pod.pl @@ -398,7 +398,7 @@ $sects{NAME} = "$fn \- $tl\n"; $sects{FOOTNOTES} .= "=back\n" if exists $sects{FOOTNOTES}; for $sect (qw(NAME SYNOPSIS DESCRIPTION OPTIONS ENVIRONMENT FILES - BUGS NOTES FOOTNOTES SEEALSO AUTHOR COPYRIGHT)) { + BUGS NOTES FOOTNOTES EXAMPLES SEEALSO AUTHOR COPYRIGHT)) { if(exists $sects{$sect}) { $head = $sect; $head =~ s/SEEALSO/SEE ALSO/; -- 2.20.1
[Qemu-devel] [PATCH v3 11/19] nbd/client: Split out nbd_receive_one_meta_context()
Extract portions of nbd_negotiate_simple_meta_context() to a new function nbd_receive_one_meta_context() that copies the pattern of nbd_receive_list() for performing the argument validation of one reply. The error message when the server replies with more than one context changes slightly, but that shouldn't happen in the common case. Signed-off-by: Eric Blake Message-Id: <20181215135324.152629-15-ebl...@redhat.com> --- v3: rebase, without changing into a loop --- nbd/client.c | 148 +-- nbd/trace-events | 2 +- 2 files changed, 92 insertions(+), 58 deletions(-) diff --git a/nbd/client.c b/nbd/client.c index 3c716be2719..22505199d3b 100644 --- a/nbd/client.c +++ b/nbd/client.c @@ -672,7 +672,86 @@ static int nbd_send_one_meta_context(QIOChannel *ioc, return ret; } -/* nbd_negotiate_simple_meta_context: +/* + * nbd_receive_one_meta_context: + * Called in a loop to receive and trace one set/list meta context reply. + * Pass non-NULL @name or @id to collect results back to the caller, which + * must eventually call g_free(). + * return 1 if name is set and iteration must continue, + *0 if iteration is complete (including if option is unsupported), + *-1 with errp set for any error + */ +static int nbd_receive_one_meta_context(QIOChannel *ioc, +uint32_t opt, +char **name, +uint32_t *id, +Error **errp) +{ +int ret; +NBDOptionReply reply; +char *local_name = NULL; +uint32_t local_id; + +if (nbd_receive_option_reply(ioc, opt, , errp) < 0) { +return -1; +} + +ret = nbd_handle_reply_err(ioc, , errp); +if (ret <= 0) { +return ret; +} + +if (reply.type == NBD_REP_ACK) { +if (reply.length != 0) { +error_setg(errp, "Unexpected length to ACK response"); +nbd_send_opt_abort(ioc); +return -1; +} +return 0; +} else if (reply.type != NBD_REP_META_CONTEXT) { +error_setg(errp, "Unexpected reply type %u (%s), expected %u (%s)", + reply.type, nbd_rep_lookup(reply.type), + NBD_REP_META_CONTEXT, nbd_rep_lookup(NBD_REP_META_CONTEXT)); +nbd_send_opt_abort(ioc); +return -1; +} + +if (reply.length <= sizeof(local_id) || +reply.length > NBD_MAX_BUFFER_SIZE) { +error_setg(errp, "Failed to negotiate meta context, server " + "answered with unexpected length %" PRIu32, + reply.length); +nbd_send_opt_abort(ioc); +return -1; +} + +if (nbd_read(ioc, _id, sizeof(local_id), errp) < 0) { +return -1; +} +local_id = be32_to_cpu(local_id); + +reply.length -= sizeof(local_id); +local_name = g_malloc(reply.length + 1); +if (nbd_read(ioc, local_name, reply.length, errp) < 0) { +g_free(local_name); +return -1; +} +local_name[reply.length] = '\0'; +trace_nbd_opt_meta_reply(nbd_opt_lookup(opt), local_name, local_id); + +if (name) { +*name = local_name; +} else { +g_free(local_name); +} +if (id) { +*id = local_id; +} +return 1; +} + +/* + * nbd_negotiate_simple_meta_context: * Request the server to set the meta context for export @info->name * using @info->x_dirty_bitmap with a fallback to "base:allocation", * setting @info->context_id to the resulting id. Fail if the server @@ -693,50 +772,21 @@ static int nbd_negotiate_simple_meta_context(QIOChannel *ioc, * function should lose the term _simple. */ int ret; -NBDOptionReply reply; const char *context = info->x_dirty_bitmap ?: "base:allocation"; bool received = false; +char *name = NULL; if (nbd_send_one_meta_context(ioc, NBD_OPT_SET_META_CONTEXT, info->name, context, errp) < 0) { return -1; } -if (nbd_receive_option_reply(ioc, NBD_OPT_SET_META_CONTEXT, , - errp) < 0) -{ +ret = nbd_receive_one_meta_context(ioc, NBD_OPT_SET_META_CONTEXT, + , >context_id, errp); +if (ret < 0) { return -1; } - -ret = nbd_handle_reply_err(ioc, , errp); -if (ret <= 0) { -return ret; -} - -if (reply.type == NBD_REP_META_CONTEXT) { -char *name; - -if (reply.length != sizeof(info->context_id) + strlen(context)) { -error_setg(errp, "Failed to negotiate meta context '%s', server " - "answered with unexpected length %" PRIu32, context, - reply.length); -nbd_send_opt_abort(ioc); -return -1; -} - -if (nbd_read(ioc, >context_id, sizeof(info->context_id), - errp) < 0) { -return
[Qemu-devel] [PATCH v3 04/19] nbd/server: Hoist length check to qemp_nbd_server_add
We only had two callers to nbd_export_new; qemu-nbd.c always passed a valid offset/length pair (because it already checked the file length, to ensure that offset was in bounds), while blockdev-nbd always passed 0/-1. Then nbd_export_new reduces the size to a multiple of BDRV_SECTOR_SIZE (can only happen when offset is not sector-aligned, since bdrv_getlength() currently rounds up), which can result in offset being greater than the enforced length, but that's not fatal (the server rejects client requests that exceed the advertised length). However, I'm finding it easier to work with the code if we are consistent on having both callers pass in a valid length, and just assert that things are sane in nbd_export_new. Signed-off-by: Eric Blake --- v3: new patch --- blockdev-nbd.c | 10 +- nbd/server.c | 9 ++--- 2 files changed, 11 insertions(+), 8 deletions(-) diff --git a/blockdev-nbd.c b/blockdev-nbd.c index c76d5416b90..d73ac1b026a 100644 --- a/blockdev-nbd.c +++ b/blockdev-nbd.c @@ -146,6 +146,7 @@ void qmp_nbd_server_add(const char *device, bool has_name, const char *name, BlockDriverState *bs = NULL; BlockBackend *on_eject_blk; NBDExport *exp; +int64_t len; if (!nbd_server) { error_setg(errp, "NBD server not running"); @@ -168,6 +169,13 @@ void qmp_nbd_server_add(const char *device, bool has_name, const char *name, return; } +len = bdrv_getlength(bs); +if (len < 0) { +error_setg_errno(errp, -len, + "Failed to determine the NBD export's length"); +return; +} + if (!has_writable) { writable = false; } @@ -175,7 +183,7 @@ void qmp_nbd_server_add(const char *device, bool has_name, const char *name, writable = false; } -exp = nbd_export_new(bs, 0, -1, name, NULL, bitmap, +exp = nbd_export_new(bs, 0, len, name, NULL, bitmap, writable ? 0 : NBD_FLAG_READ_ONLY, NULL, false, on_eject_blk, errp); if (!exp) { diff --git a/nbd/server.c b/nbd/server.c index e8c56607eff..c9937ccdc2a 100644 --- a/nbd/server.c +++ b/nbd/server.c @@ -1499,13 +1499,8 @@ NBDExport *nbd_export_new(BlockDriverState *bs, off_t dev_offset, off_t size, exp->name = g_strdup(name); exp->description = g_strdup(description); exp->nbdflags = nbdflags; -exp->size = size < 0 ? blk_getlength(blk) : size; -if (exp->size < 0) { -error_setg_errno(errp, -exp->size, - "Failed to determine the NBD export's length"); -goto fail; -} -exp->size -= exp->size % BDRV_SECTOR_SIZE; +assert(dev_offset <= size); +exp->size = QEMU_ALIGN_DOWN(size, BDRV_SECTOR_SIZE); if (bitmap) { BdrvDirtyBitmap *bm = NULL; -- 2.20.1
[Qemu-devel] [PATCH v3 03/19] qemu-nbd: Sanity check partition bounds
When the user requests a partition, we were using data read from the disk as disk offsets without a bounds check. We got lucky that even when computed offsets are out-of-bounds, blk_pread() will gracefully catch the error later (so I don't think a malicious image can crash or exploit qemu-nbd, and am not treating this as a security flaw), but it's better to flag the problem up front than to risk permanent EIO death of the block device down the road. Also, note that the partition code blindly overwrites any offset passed in by the user; so make the -o/-P combo an error for less confusion. This can be tested with nbdkit: $ echo hi > file $ nbdkit -fv --filter=truncate partitioning file truncate=64k Pre-patch: $ qemu-nbd -p 10810 -P 1 -f raw nbd://localhost:10809 & $ qemu-io -f raw nbd://localhost:10810 qemu-io> r -v 0 1 Disconnect client, due to: Failed to send reply: reading from file failed: Input/output error Connection closed read failed: Input/output error qemu-io> q [1]+ Doneqemu-nbd -p 10810 -P 1 -f raw nbd://localhost:10809 Post-patch: $ qemu-nbd -p 10810 -P 1 -f raw nbd://localhost:10809 qemu-nbd: Discovered partition 1 at offset 1048576 size 512, but size exceeds file length 65536 Signed-off-by: Eric Blake --- v3: new patch --- qemu-nbd.c | 18 +- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/qemu-nbd.c b/qemu-nbd.c index 51b55f2e066..ff4adb9b3eb 100644 --- a/qemu-nbd.c +++ b/qemu-nbd.c @@ -1013,12 +1013,28 @@ int main(int argc, char **argv) fd_size -= dev_offset; if (partition != -1) { -ret = find_partition(blk, partition, _offset, _size); +off_t limit; + +if (dev_offset) { +error_report("Cannot request partition and offset together"); +exit(EXIT_FAILURE); +} +ret = find_partition(blk, partition, _offset, ); if (ret < 0) { error_report("Could not find partition %d: %s", partition, strerror(-ret)); exit(EXIT_FAILURE); } +/* partition limits are (32-bit << 9); can't overflow 64 bits */ +assert(dev_offset >= 0 && dev_offset + limit >= dev_offset); +if (dev_offset + limit > fd_size) { +error_report("Discovered partition %d at offset %lld size %lld, " + "but size exceeds file length %lld", partition, + (long long int) dev_offset, (long long int) limit, + (long long int) fd_size); +exit(EXIT_FAILURE); +} +fd_size = limit; } export = nbd_export_new(bs, dev_offset, fd_size, export_name, -- 2.20.1
[Qemu-devel] [PATCH v3 08/19] nbd/client: Move export name into NBDExportInfo
Refactor the 'name' parameter of nbd_receive_negotiate() from being a separate parameter into being part of the in-out 'info'. This also spills over to a simplification of nbd_opt_go(). The main driver for this refactoring is that an upcoming patch would like to add support to qemu-nbd to list information about all exports available on a server, where the name(s) will be provided by the server instead of the client. But another benefit is that we can now allow the client to explicitly specify the empty export name "" even when connecting to an oldstyle server (even if qemu is no longer such a server after commit 7f7dfe2a). Signed-off-by: Eric Blake Reviewed-by: Richard W.M. Jones Reviewed-by: Vladimir Sementsov-Ogievskiy Message-Id: <20181215135324.152629-11-ebl...@redhat.com> --- include/block/nbd.h | 8 block/nbd-client.c | 5 +++-- nbd/client.c| 39 ++- qemu-nbd.c | 6 -- nbd/trace-events| 2 +- 5 files changed, 30 insertions(+), 30 deletions(-) diff --git a/include/block/nbd.h b/include/block/nbd.h index 0f252829376..6d76d3dc221 100644 --- a/include/block/nbd.h +++ b/include/block/nbd.h @@ -262,6 +262,7 @@ struct NBDExportInfo { /* Set by client before nbd_receive_negotiate() */ bool request_sizes; char *x_dirty_bitmap; +char *name; /* must be non-NULL */ /* In-out fields, set by client before nbd_receive_negotiate() and * updated by server results during nbd_receive_negotiate() */ @@ -279,10 +280,9 @@ struct NBDExportInfo { }; typedef struct NBDExportInfo NBDExportInfo; -int nbd_receive_negotiate(QIOChannel *ioc, const char *name, - QCryptoTLSCreds *tlscreds, const char *hostname, - QIOChannel **outioc, NBDExportInfo *info, - Error **errp); +int nbd_receive_negotiate(QIOChannel *ioc, QCryptoTLSCreds *tlscreds, + const char *hostname, QIOChannel **outioc, + NBDExportInfo *info, Error **errp); int nbd_init(int fd, QIOChannelSocket *sioc, NBDExportInfo *info, Error **errp); int nbd_send_request(QIOChannel *ioc, NBDRequest *request); diff --git a/block/nbd-client.c b/block/nbd-client.c index ef320759716..3309376bc16 100644 --- a/block/nbd-client.c +++ b/block/nbd-client.c @@ -999,10 +999,11 @@ int nbd_client_init(BlockDriverState *bs, client->info.structured_reply = true; client->info.base_allocation = true; client->info.x_dirty_bitmap = g_strdup(x_dirty_bitmap); -ret = nbd_receive_negotiate(QIO_CHANNEL(sioc), export, -tlscreds, hostname, +client->info.name = g_strdup(export ?: ""); +ret = nbd_receive_negotiate(QIO_CHANNEL(sioc), tlscreds, hostname, >ioc, >info, errp); g_free(client->info.x_dirty_bitmap); +g_free(client->info.name); if (ret < 0) { logout("Failed to negotiate with the NBD server\n"); return ret; diff --git a/nbd/client.c b/nbd/client.c index fd4ba8dec37..8227e69478a 100644 --- a/nbd/client.c +++ b/nbd/client.c @@ -330,15 +330,14 @@ static int nbd_receive_list(QIOChannel *ioc, char **name, char **description, } -/* Returns -1 if NBD_OPT_GO proves the export @wantname cannot be +/* Returns -1 if NBD_OPT_GO proves the export @info->name cannot be * used, 0 if NBD_OPT_GO is unsupported (fall back to NBD_OPT_LIST and * NBD_OPT_EXPORT_NAME in that case), and > 0 if the export is good to - * go (with @info populated). */ -static int nbd_opt_go(QIOChannel *ioc, const char *wantname, - NBDExportInfo *info, Error **errp) + * go (with the rest of @info populated). */ +static int nbd_opt_go(QIOChannel *ioc, NBDExportInfo *info, Error **errp) { NBDOptionReply reply; -uint32_t len = strlen(wantname); +uint32_t len = strlen(info->name); uint16_t type; int error; char *buf; @@ -348,10 +347,10 @@ static int nbd_opt_go(QIOChannel *ioc, const char *wantname, * flags still 0 is a witness of a broken server. */ info->flags = 0; -trace_nbd_opt_go_start(wantname); +trace_nbd_opt_go_start(info->name); buf = g_malloc(4 + len + 2 + 2 * info->request_sizes + 1); stl_be_p(buf, len); -memcpy(buf + 4, wantname, len); +memcpy(buf + 4, info->name, len); /* At most one request, everything else up to server */ stw_be_p(buf + 4 + len, info->request_sizes); if (info->request_sizes) { @@ -753,10 +752,9 @@ static int nbd_negotiate_simple_meta_context(QIOChannel *ioc, return 0; } -int nbd_receive_negotiate(QIOChannel *ioc, const char *name, - QCryptoTLSCreds *tlscreds, const char *hostname, - QIOChannel **outioc, NBDExportInfo *info, - Error **errp) +int nbd_receive_negotiate(QIOChannel *ioc, QCryptoTLSCreds *tlscreds, + const char
[Qemu-devel] [PATCH v3 14/19] nbd/client: Pull out oldstyle size determination
Another refactoring creating nbd_negotiate_finish_oldstyle() for further reuse during 'qemu-nbd --list'. Signed-off-by: Eric Blake Message-Id: <20181215135324.152629-18-ebl...@redhat.com> Reviewed-by: Richard W.M. Jones --- nbd/client.c | 49 - 1 file changed, 32 insertions(+), 17 deletions(-) diff --git a/nbd/client.c b/nbd/client.c index 5053433ea5e..620fbb5ef01 100644 --- a/nbd/client.c +++ b/nbd/client.c @@ -818,7 +818,7 @@ static int nbd_negotiate_simple_meta_context(QIOChannel *ioc, * Start the handshake to the server. After a positive return, the server * is ready to accept additional NBD_OPT requests. * Returns: negative errno: failure talking to server - * 0: server is oldstyle, client must still parse export size + * 0: server is oldstyle, must call nbd_negotiate_finish_oldstyle * 1: server is newstyle, but can only accept EXPORT_NAME * 2: server is newstyle, but lacks structured replies * 3: server is newstyle and set up for structured replies @@ -923,6 +923,36 @@ static int nbd_start_negotiate(QIOChannel *ioc, QCryptoTLSCreds *tlscreds, } } +/* + * nbd_negotiate_finish_oldstyle: + * Populate @info with the size and export flags from an oldstyle server, + * but does not consume 124 bytes of reserved zero padding. + * Returns 0 on success, -1 with @errp set on failure + */ +static int nbd_negotiate_finish_oldstyle(QIOChannel *ioc, NBDExportInfo *info, + Error **errp) +{ +uint32_t oldflags; + +if (nbd_read(ioc, >size, sizeof(info->size), errp) < 0) { +error_prepend(errp, "Failed to read export length: "); +return -EINVAL; +} +info->size = be64_to_cpu(info->size); + +if (nbd_read(ioc, , sizeof(oldflags), errp) < 0) { +error_prepend(errp, "Failed to read export flags: "); +return -EINVAL; +} +oldflags = be32_to_cpu(oldflags); +if (oldflags & ~0x) { +error_setg(errp, "Unexpected export flags %0x" PRIx32, oldflags); +return -EINVAL; +} +info->flags = oldflags; +return 0; +} + /* * nbd_receive_negotiate: * Connect to server, complete negotiation, and move into transmission phase. @@ -936,7 +966,6 @@ int nbd_receive_negotiate(QIOChannel *ioc, QCryptoTLSCreds *tlscreds, int result; bool zeroes = true; bool base_allocation = info->base_allocation; -uint32_t oldflags; assert(info->name); trace_nbd_receive_negotiate_name(info->name); @@ -1009,23 +1038,9 @@ int nbd_receive_negotiate(QIOChannel *ioc, QCryptoTLSCreds *tlscreds, error_setg(errp, "Server does not support non-empty export names"); return -EINVAL; } - -if (nbd_read(ioc, >size, sizeof(info->size), errp) < 0) { -error_prepend(errp, "Failed to read export length: "); +if (nbd_negotiate_finish_oldstyle(ioc, info, errp) < 0) { return -EINVAL; } -info->size = be64_to_cpu(info->size); - -if (nbd_read(ioc, , sizeof(oldflags), errp) < 0) { -error_prepend(errp, "Failed to read export flags: "); -return -EINVAL; -} -oldflags = be32_to_cpu(oldflags); -if (oldflags & ~0x) { -error_setg(errp, "Unexpected export flags %0x" PRIx32, oldflags); -return -EINVAL; -} -info->flags = oldflags; break; default: return result; -- 2.20.1
[Qemu-devel] [PATCH v3 02/19] qemu-nbd: Enhance man page
Document some useful qemu-nbd command lines. Mention some restrictions on particular options, like -p being only for MBR images, or -c/-d being Linux-only. Update some text given the recent change to no longer serve oldstyle protocol (missed in commit 7f7dfe2a). Also, consistently use trailing '.' in describing options. Signed-off-by: Eric Blake --- v3: wording improvements, use -t in more examples [Rich] --- qemu-nbd.texi | 91 --- 1 file changed, 72 insertions(+), 19 deletions(-) diff --git a/qemu-nbd.texi b/qemu-nbd.texi index 96b1546006a..3f22559beb4 100644 --- a/qemu-nbd.texi +++ b/qemu-nbd.texi @@ -10,11 +10,17 @@ Export a QEMU disk image using the NBD protocol. +Other uses: +@itemize +@item +Bind a /dev/nbdX block device to a QEMU server (on Linux). +@end itemize + @c man end @c man begin OPTIONS @var{filename} is a disk image filename, or a set of block -driver options if @var{--image-opts} is specified. +driver options if @option{--image-opts} is specified. @var{dev} is an NBD device. @@ -27,24 +33,25 @@ supported. The common object types that it makes sense to define are the keys, and the @code{tls-creds} object, which is used to supply TLS credentials for the qemu-nbd server. @item -p, --port=@var{port} -The TCP port to listen on (default @samp{10809}) +The TCP port to listen on (default @samp{10809}). @item -o, --offset=@var{offset} -The offset into the image +The offset into the image. @item -b, --bind=@var{iface} -The interface to bind to (default @samp{0.0.0.0}) +The interface to bind to (default @samp{0.0.0.0}). @item -k, --socket=@var{path} -Use a unix socket with path @var{path} +Use a unix socket with path @var{path}. @item --image-opts Treat @var{filename} as a set of image options, instead of a plain filename. If this flag is specified, the @var{-f} flag should not be used, instead the '@code{format=}' option should be set. @item -f, --format=@var{fmt} Force the use of the block driver for format @var{fmt} instead of -auto-detecting +auto-detecting. @item -r, --read-only -Export the disk as read-only +Export the disk as read-only. @item -P, --partition=@var{num} -Only expose partition @var{num} +Only expose MBR partition @var{num}. Understands physical partitions +1-4 and logical partitions 5-8. @item -B, --bitmap=@var{name} If @var{filename} has a qcow2 persistent bitmap @var{name}, expose that bitmap via the ``qemu:dirty-bitmap:@var{name}'' context @@ -52,7 +59,7 @@ accessible through NBD_OPT_SET_META_CONTEXT. @item -s, --snapshot Use @var{filename} as an external snapshot, create a temporary file with backing_file=@var{filename}, redirect the write to -the temporary one +the temporary one. @item -l, --load-snapshot=@var{snapshot_param} Load an internal snapshot inside @var{filename} and export it as an read-only device, @var{snapshot_param} format is @@ -76,19 +83,20 @@ driver-specific optimized zero write commands. @var{detect-zeroes} is one of converts a zero write to an unmap operation and can only be used if @var{discard} is set to @samp{unmap}. The default is @samp{off}. @item -c, --connect=@var{dev} -Connect @var{filename} to NBD device @var{dev} +Connect @var{filename} to NBD device @var{dev} (Linux only). @item -d, --disconnect -Disconnect the device @var{dev} +Disconnect the device @var{dev} (Linux only). @item -e, --shared=@var{num} -Allow up to @var{num} clients to share the device (default @samp{1}) +Allow up to @var{num} clients to share the device (default +@samp{1}). Safe for readers, but for now, consistency is not +guaranteed between multiple writers. @item -t, --persistent -Don't exit on the last connection +Don't exit on the last connection. @item -x, --export-name=@var{name} -Set the NBD volume export name. This switches the server to use -the new style NBD protocol negotiation +Set the NBD volume export name (default of a zero-length string). @item -D, --description=@var{description} Set the NBD volume export description, as a human-readable -string. Requires the use of @option{-x} +string. @item --tls-creds=ID Enable mandatory TLS encryption for the server by setting the ID of the TLS credentials object previously created with the --object @@ -96,11 +104,11 @@ option. @item --fork Fork off the server process and exit the parent once the server is running. @item -v, --verbose -Display extra debugging information +Display extra debugging information. @item -h, --help -Display this help and exit +Display this help and exit. @item -V, --version -Display version information and exit +Display version information and exit. @item -T, --trace [[enable=]@var{pattern}][,events=@var{file}][,file=@var{file}] @findex --trace @include qemu-option-trace.texi @@ -108,6 +116,51 @@ Display version information and exit @c man end +@c man begin EXAMPLES +Start a server listening on port 10809 that exposes only the +guest-visible contents of a qcow2 file, with no TLS
[Qemu-devel] [PATCH v3 13/19] nbd/client: Split handshake into two functions
An upcoming patch will add the ability for qemu-nbd to list the services provided by an NBD server. Share the common code of the TLS handshake by splitting the initial exchange into a separate function, leaving only the export handling in the original function. Functionally, there should be no change in behavior in this patch, although some of the code motion may be difficult to follow due to indentation changes (view with 'git diff -w' for a smaller changeset). I considered an enum for the return code coordinating state between the two functions, but in the end just settled with ample comments. Signed-off-by: Eric Blake Reviewed-by: Richard W.M. Jones Reviewed-by: Vladimir Sementsov-Ogievskiy Message-Id: <20181215135324.152629-17-ebl...@redhat.com> --- nbd/client.c | 144 +++ nbd/trace-events | 2 +- 2 files changed, 95 insertions(+), 51 deletions(-) diff --git a/nbd/client.c b/nbd/client.c index fa931fd8e5d..5053433ea5e 100644 --- a/nbd/client.c +++ b/nbd/client.c @@ -813,21 +813,24 @@ static int nbd_negotiate_simple_meta_context(QIOChannel *ioc, return received; } -int nbd_receive_negotiate(QIOChannel *ioc, QCryptoTLSCreds *tlscreds, - const char *hostname, QIOChannel **outioc, - NBDExportInfo *info, Error **errp) +/* + * nbd_start_negotiate: + * Start the handshake to the server. After a positive return, the server + * is ready to accept additional NBD_OPT requests. + * Returns: negative errno: failure talking to server + * 0: server is oldstyle, client must still parse export size + * 1: server is newstyle, but can only accept EXPORT_NAME + * 2: server is newstyle, but lacks structured replies + * 3: server is newstyle and set up for structured replies + */ +static int nbd_start_negotiate(QIOChannel *ioc, QCryptoTLSCreds *tlscreds, + const char *hostname, QIOChannel **outioc, + bool structured_reply, bool *zeroes, + Error **errp) { uint64_t magic; -bool zeroes = true; -bool structured_reply = info->structured_reply; -bool base_allocation = info->base_allocation; -trace_nbd_receive_negotiate(tlscreds, hostname ? hostname : ""); - -assert(info->name); -trace_nbd_receive_negotiate_name(info->name); -info->structured_reply = false; -info->base_allocation = false; +trace_nbd_start_negotiate(tlscreds, hostname ? hostname : ""); if (outioc) { *outioc = NULL; @@ -872,7 +875,7 @@ int nbd_receive_negotiate(QIOChannel *ioc, QCryptoTLSCreds *tlscreds, clientflags |= NBD_FLAG_C_FIXED_NEWSTYLE; } if (globalflags & NBD_FLAG_NO_ZEROES) { -zeroes = false; +*zeroes = false; clientflags |= NBD_FLAG_C_NO_ZEROES; } /* client requested flags */ @@ -894,7 +897,7 @@ int nbd_receive_negotiate(QIOChannel *ioc, QCryptoTLSCreds *tlscreds, } } if (fixedNewStyle) { -int result; +int result = 0; if (structured_reply) { result = nbd_request_simple_option(ioc, @@ -903,39 +906,85 @@ int nbd_receive_negotiate(QIOChannel *ioc, QCryptoTLSCreds *tlscreds, if (result < 0) { return -EINVAL; } -info->structured_reply = result == 1; } +return 2 + result; +} else { +return 1; +} +} else if (magic == NBD_CLIENT_MAGIC) { +if (tlscreds) { +error_setg(errp, "Server does not support STARTTLS"); +return -EINVAL; +} +return 0; +} else { +error_setg(errp, "Bad server magic received: 0x%" PRIx64, magic); +return -EINVAL; +} +} -if (info->structured_reply && base_allocation) { -result = nbd_negotiate_simple_meta_context(ioc, info, errp); -if (result < 0) { -return -EINVAL; -} -info->base_allocation = result == 1; -} +/* + * nbd_receive_negotiate: + * Connect to server, complete negotiation, and move into transmission phase. + * Returns: negative errno: failure talking to server + * 0: server is connected + */ +int nbd_receive_negotiate(QIOChannel *ioc, QCryptoTLSCreds *tlscreds, + const char *hostname, QIOChannel **outioc, + NBDExportInfo *info, Error **errp) +{ +int result; +bool zeroes = true; +bool base_allocation = info->base_allocation; +uint32_t oldflags; + +assert(info->name); +trace_nbd_receive_negotiate_name(info->name); + +result = nbd_start_negotiate(ioc, tlscreds, hostname, outioc, + info->structured_reply, , errp); + +info->structured_reply =
[Qemu-devel] [PATCH v3 15/19] nbd/client: Add nbd_receive_export_list()
We want to be able to detect whether a given qemu NBD server is exposing the right export(s) and dirty bitmaps, at least for regression testing. We could use 'nbd-client -l' from the upstream NBD project to list exports, but it's annoying to rely on out-of-tree binaries; furthermore, nbd-client doesn't necessarily know about all of the qemu NBD extensions. Thus, we plan on adding a new mode to qemu-nbd that merely sniffs all possible information from the server during handshake phase, then disconnects and dumps the information. This patch adds the low-level client code for grabbing the list of exports. It benefits from the recent refactoring patches, as well as a minor tweak of changing nbd_opt_go() to nbd_opt_info_or_go(), in order to share as much code as possible when it comes to doing validation of server replies. The resulting information is stored in an array of NBDExportInfo which has been expanded to any description string, along with a convenience function for freeing the list. Note: a malicious server could exhaust memory of a client by feeding an unending loop of exports; perhaps we should place a limit on how many we are willing to receive. But note that a server could reasonably be serving an export for every file in a large directory, where an arbitrary limit in the client means we can't list anything from such a server; the same happens if we just run until the client fails to malloc() and thus dies by an abort(), where the limit is no longer arbitrary but determined by available memory. Since the client is already planning on being short-lived, it's hard to call this a denial of service attack that would starve off other uses, so it does not appear to be a security issue. Signed-off-by: Eric Blake Message-Id: <20181215135324.152629-19-ebl...@redhat.com> Reviewed-by: Richard W.M. Jones --- v3: mention security (non-)issue in commit message [Rich], formatting tweaks --- include/block/nbd.h | 15 - nbd/client.c| 144 +--- nbd/trace-events| 2 +- 3 files changed, 150 insertions(+), 11 deletions(-) diff --git a/include/block/nbd.h b/include/block/nbd.h index bdc7ec7195a..e9a442ce7e9 100644 --- a/include/block/nbd.h +++ b/include/block/nbd.h @@ -1,5 +1,5 @@ /* - * Copyright (C) 2016-2017 Red Hat, Inc. + * Copyright (C) 2016-2019 Red Hat, Inc. * Copyright (C) 2005 Anthony Liguori * * Network Block Device @@ -262,6 +262,9 @@ struct NBDExportInfo { /* Set by client before nbd_receive_negotiate() */ bool request_sizes; char *x_dirty_bitmap; + +/* Set by client before nbd_receive_negotiate(), or by server results + * during nbd_receive_export_list() */ char *name; /* must be non-NULL */ /* In-out fields, set by client before nbd_receive_negotiate() and @@ -269,7 +272,8 @@ struct NBDExportInfo { bool structured_reply; bool base_allocation; /* base:allocation context for NBD_CMD_BLOCK_STATUS */ -/* Set by server results during nbd_receive_negotiate() */ +/* Set by server results during nbd_receive_negotiate() and + * nbd_receive_export_list() */ uint64_t size; uint16_t flags; uint32_t min_block; @@ -277,12 +281,19 @@ struct NBDExportInfo { uint32_t max_block; uint32_t context_id; + +/* Set by server results during nbd_receive_export_list() */ +char *description; }; typedef struct NBDExportInfo NBDExportInfo; int nbd_receive_negotiate(QIOChannel *ioc, QCryptoTLSCreds *tlscreds, const char *hostname, QIOChannel **outioc, NBDExportInfo *info, Error **errp); +void nbd_free_export_list(NBDExportInfo *info, int count); +int nbd_receive_export_list(QIOChannel *ioc, QCryptoTLSCreds *tlscreds, +const char *hostname, NBDExportInfo **info, +Error **errp); int nbd_init(int fd, QIOChannelSocket *sioc, NBDExportInfo *info, Error **errp); int nbd_send_request(QIOChannel *ioc, NBDRequest *request); diff --git a/nbd/client.c b/nbd/client.c index 620fbb5ef01..a5a705a67f7 100644 --- a/nbd/client.c +++ b/nbd/client.c @@ -330,11 +330,14 @@ static int nbd_receive_list(QIOChannel *ioc, char **name, char **description, } -/* Returns -1 if NBD_OPT_GO proves the export @info->name cannot be +/* + * Returns -1 if NBD_OPT_GO proves the export @info->name cannot be * used, 0 if NBD_OPT_GO is unsupported (fall back to NBD_OPT_LIST and * NBD_OPT_EXPORT_NAME in that case), and > 0 if the export is good to - * go (with the rest of @info populated). */ -static int nbd_opt_go(QIOChannel *ioc, NBDExportInfo *info, Error **errp) + * go (with the rest of @info populated). + */ +static int nbd_opt_info_or_go(QIOChannel *ioc, uint32_t opt, + NBDExportInfo *info, Error **errp) { NBDOptionReply reply; uint32_t len = strlen(info->name); @@ -347,7 +350,8 @@ static int nbd_opt_go(QIOChannel *ioc,
[Qemu-devel] [PATCH v3 10/19] nbd/client: Split out nbd_send_one_meta_context()
Refactor nbd_negotiate_simple_meta_context() to pull out the code that can be reused to send a LIST request for 0 or 1 query. No semantic change. The old comment about 'sizeof(uint32_t)' being equivalent to '/* number of queries */' is no longer needed, now that we are computing 'sizeof(queries)' instead. Signed-off-by: Eric Blake Message-Id: <20181215135324.152629-14-ebl...@redhat.com> Reviewed-by: Richard W.M. Jones --- v3: Improve commit message [Rich], formatting tweak [checkpatch], rebase to dropped patch --- nbd/client.c | 67 +--- nbd/trace-events | 2 +- 2 files changed, 48 insertions(+), 21 deletions(-) diff --git a/nbd/client.c b/nbd/client.c index 77993890f04..3c716be2719 100644 --- a/nbd/client.c +++ b/nbd/client.c @@ -629,6 +629,49 @@ static QIOChannel *nbd_receive_starttls(QIOChannel *ioc, return QIO_CHANNEL(tioc); } +/* + * nbd_send_one_meta_context: + * Send 0 or 1 set/list meta context queries. + * Return 0 on success, -1 with errp set for any error + */ +static int nbd_send_one_meta_context(QIOChannel *ioc, + uint32_t opt, + const char *export, + const char *query, + Error **errp) +{ +int ret; +uint32_t export_len = strlen(export); +uint32_t queries = !!query; +uint32_t context_len = 0; +uint32_t data_len; +char *data; +char *p; + +data_len = sizeof(export_len) + export_len + sizeof(queries); +if (query) { +context_len = strlen(query); +data_len += sizeof(context_len) + context_len; +} else { +assert(opt == NBD_OPT_LIST_META_CONTEXT); +} +data = g_malloc(data_len); +p = data; + +trace_nbd_opt_meta_request(nbd_opt_lookup(opt), query ?: "(all)", export); +stl_be_p(p, export_len); +memcpy(p += sizeof(export_len), export, export_len); +stl_be_p(p += export_len, queries); +if (query) { +stl_be_p(p += sizeof(uint32_t), context_len); +memcpy(p += sizeof(context_len), query, context_len); +} + +ret = nbd_send_option_request(ioc, opt, data_len, data, errp); +g_free(data); +return ret; +} + /* nbd_negotiate_simple_meta_context: * Request the server to set the meta context for export @info->name * using @info->x_dirty_bitmap with a fallback to "base:allocation", @@ -653,26 +696,10 @@ static int nbd_negotiate_simple_meta_context(QIOChannel *ioc, NBDOptionReply reply; const char *context = info->x_dirty_bitmap ?: "base:allocation"; bool received = false; -uint32_t export_len = strlen(info->name); -uint32_t context_len = strlen(context); -uint32_t data_len = sizeof(export_len) + export_len + -sizeof(uint32_t) + /* number of queries */ -sizeof(context_len) + context_len; -char *data = g_malloc(data_len); -char *p = data; -trace_nbd_opt_meta_request(context, info->name); -stl_be_p(p, export_len); -memcpy(p += sizeof(export_len), info->name, export_len); -stl_be_p(p += export_len, 1); -stl_be_p(p += sizeof(uint32_t), context_len); -memcpy(p += sizeof(context_len), context, context_len); - -ret = nbd_send_option_request(ioc, NBD_OPT_SET_META_CONTEXT, data_len, data, - errp); -g_free(data); -if (ret < 0) { -return ret; +if (nbd_send_one_meta_context(ioc, NBD_OPT_SET_META_CONTEXT, + info->name, context, errp) < 0) { +return -1; } if (nbd_receive_option_reply(ioc, NBD_OPT_SET_META_CONTEXT, , @@ -689,7 +716,7 @@ static int nbd_negotiate_simple_meta_context(QIOChannel *ioc, if (reply.type == NBD_REP_META_CONTEXT) { char *name; -if (reply.length != sizeof(info->context_id) + context_len) { +if (reply.length != sizeof(info->context_id) + strlen(context)) { error_setg(errp, "Failed to negotiate meta context '%s', server " "answered with unexpected length %" PRIu32, context, reply.length); diff --git a/nbd/trace-events b/nbd/trace-events index c3966d2b653..59521e47a3d 100644 --- a/nbd/trace-events +++ b/nbd/trace-events @@ -12,7 +12,7 @@ nbd_receive_query_exports_start(const char *wantname) "Querying export list for nbd_receive_query_exports_success(const char *wantname) "Found desired export name '%s'" nbd_receive_starttls_new_client(void) "Setting up TLS" nbd_receive_starttls_tls_handshake(void) "Starting TLS handshake" -nbd_opt_meta_request(const char *context, const char *export) "Requesting to set meta context %s for export %s" +nbd_opt_meta_request(const char *optname, const char *context, const char *export) "Requesting %s %s for export %s" nbd_opt_meta_reply(const char *context, uint32_t id) "Received mapping of context %s to id %" PRIu32
[Qemu-devel] [PATCH v3 12/19] nbd/client: Refactor return of nbd_receive_negotiate()
The function could only ever return 0 or -EINVAL; make this clearer by dropping a useless 'fail:' label. Signed-off-by: Eric Blake Reviewed-by: Richard W.M. Jones Reviewed-by: Vladimir Sementsov-Ogievskiy Message-Id: <20181215135324.152629-16-ebl...@redhat.com> --- nbd/client.c | 51 +++ 1 file changed, 23 insertions(+), 28 deletions(-) diff --git a/nbd/client.c b/nbd/client.c index 22505199d3b..fa931fd8e5d 100644 --- a/nbd/client.c +++ b/nbd/client.c @@ -818,7 +818,6 @@ int nbd_receive_negotiate(QIOChannel *ioc, QCryptoTLSCreds *tlscreds, NBDExportInfo *info, Error **errp) { uint64_t magic; -int rc; bool zeroes = true; bool structured_reply = info->structured_reply; bool base_allocation = info->base_allocation; @@ -829,31 +828,30 @@ int nbd_receive_negotiate(QIOChannel *ioc, QCryptoTLSCreds *tlscreds, trace_nbd_receive_negotiate_name(info->name); info->structured_reply = false; info->base_allocation = false; -rc = -EINVAL; if (outioc) { *outioc = NULL; } if (tlscreds && !outioc) { error_setg(errp, "Output I/O channel required for TLS"); -goto fail; +return -EINVAL; } if (nbd_read(ioc, , sizeof(magic), errp) < 0) { error_prepend(errp, "Failed to read initial magic: "); -goto fail; +return -EINVAL; } magic = be64_to_cpu(magic); trace_nbd_receive_negotiate_magic(magic); if (magic != NBD_INIT_MAGIC) { error_setg(errp, "Bad initial magic received: 0x%" PRIx64, magic); -goto fail; +return -EINVAL; } if (nbd_read(ioc, , sizeof(magic), errp) < 0) { error_prepend(errp, "Failed to read server magic: "); -goto fail; +return -EINVAL; } magic = be64_to_cpu(magic); trace_nbd_receive_negotiate_magic(magic); @@ -865,7 +863,7 @@ int nbd_receive_negotiate(QIOChannel *ioc, QCryptoTLSCreds *tlscreds, if (nbd_read(ioc, , sizeof(globalflags), errp) < 0) { error_prepend(errp, "Failed to read server flags: "); -goto fail; +return -EINVAL; } globalflags = be16_to_cpu(globalflags); trace_nbd_receive_negotiate_server_flags(globalflags); @@ -881,18 +879,18 @@ int nbd_receive_negotiate(QIOChannel *ioc, QCryptoTLSCreds *tlscreds, clientflags = cpu_to_be32(clientflags); if (nbd_write(ioc, , sizeof(clientflags), errp) < 0) { error_prepend(errp, "Failed to send clientflags field: "); -goto fail; +return -EINVAL; } if (tlscreds) { if (fixedNewStyle) { *outioc = nbd_receive_starttls(ioc, tlscreds, hostname, errp); if (!*outioc) { -goto fail; +return -EINVAL; } ioc = *outioc; } else { error_setg(errp, "Server does not support STARTTLS"); -goto fail; +return -EINVAL; } } if (fixedNewStyle) { @@ -903,7 +901,7 @@ int nbd_receive_negotiate(QIOChannel *ioc, QCryptoTLSCreds *tlscreds, NBD_OPT_STRUCTURED_REPLY, errp); if (result < 0) { -goto fail; +return -EINVAL; } info->structured_reply = result == 1; } @@ -911,7 +909,7 @@ int nbd_receive_negotiate(QIOChannel *ioc, QCryptoTLSCreds *tlscreds, if (info->structured_reply && base_allocation) { result = nbd_negotiate_simple_meta_context(ioc, info, errp); if (result < 0) { -goto fail; +return -EINVAL; } info->base_allocation = result == 1; } @@ -923,7 +921,7 @@ int nbd_receive_negotiate(QIOChannel *ioc, QCryptoTLSCreds *tlscreds, * export, then use NBD_OPT_EXPORT_NAME. */ result = nbd_opt_go(ioc, info, errp); if (result < 0) { -goto fail; +return -EINVAL; } if (result > 0) { return 0; @@ -935,25 +933,25 @@ int nbd_receive_negotiate(QIOChannel *ioc, QCryptoTLSCreds *tlscreds, * export name is not available. */ if (nbd_receive_query_exports(ioc, info->name, errp) < 0) { -goto fail; +return -EINVAL; } } /* write the export name request */ if (nbd_send_option_request(ioc, NBD_OPT_EXPORT_NAME, -1, info->name, errp) < 0) { -goto fail; +return -EINVAL; } /* Read the response */ if (nbd_read(ioc,
[Qemu-devel] [PATCH v3 05/19] nbd/server: Favor [u]int64_t over off_t
Although our compile-time environment is set up so that we always support long files with 64-bit off_t, we have no guarantee whether off_t is the same type as int64_t. This requires casts when printing values, and prevents us from directly using qemu_strtoi64(). Let's just flip to [u]int64_t (signed for length, because we have to detect failure of blk_getlength() and because off_t was signed; unsigned for offset because it lets us simplify some math without having to worry about signed overflow). Suggested-by: Vladimir Sementsov-Ogievskiy Signed-off-by: Eric Blake --- v3: new patch --- include/block/nbd.h | 4 ++-- nbd/server.c| 14 +++--- qemu-nbd.c | 26 ++ 3 files changed, 19 insertions(+), 25 deletions(-) diff --git a/include/block/nbd.h b/include/block/nbd.h index 1971b557896..0f252829376 100644 --- a/include/block/nbd.h +++ b/include/block/nbd.h @@ -294,8 +294,8 @@ int nbd_errno_to_system_errno(int err); typedef struct NBDExport NBDExport; typedef struct NBDClient NBDClient; -NBDExport *nbd_export_new(BlockDriverState *bs, off_t dev_offset, off_t size, - const char *name, const char *description, +NBDExport *nbd_export_new(BlockDriverState *bs, uint64_t dev_offset, + int64_t size, const char *name, const char *desc, const char *bitmap, uint16_t nbdflags, void (*close)(NBDExport *), bool writethrough, BlockBackend *on_eject_blk, Error **errp); diff --git a/nbd/server.c b/nbd/server.c index c9937ccdc2a..15357d40fd7 100644 --- a/nbd/server.c +++ b/nbd/server.c @@ -77,8 +77,8 @@ struct NBDExport { BlockBackend *blk; char *name; char *description; -off_t dev_offset; -off_t size; +uint64_t dev_offset; +int64_t size; uint16_t nbdflags; QTAILQ_HEAD(, NBDClient) clients; QTAILQ_ENTRY(NBDExport) next; @@ -1455,8 +1455,8 @@ static void nbd_eject_notifier(Notifier *n, void *data) nbd_export_close(exp); } -NBDExport *nbd_export_new(BlockDriverState *bs, off_t dev_offset, off_t size, - const char *name, const char *description, +NBDExport *nbd_export_new(BlockDriverState *bs, uint64_t dev_offset, + int64_t size, const char *name, const char *desc, const char *bitmap, uint16_t nbdflags, void (*close)(NBDExport *), bool writethrough, BlockBackend *on_eject_blk, Error **errp) @@ -1497,7 +1497,7 @@ NBDExport *nbd_export_new(BlockDriverState *bs, off_t dev_offset, off_t size, exp->blk = blk; exp->dev_offset = dev_offset; exp->name = g_strdup(name); -exp->description = g_strdup(description); +exp->description = g_strdup(desc); exp->nbdflags = nbdflags; assert(dev_offset <= size); exp->size = QEMU_ALIGN_DOWN(size, BDRV_SECTOR_SIZE); @@ -2131,8 +2131,8 @@ static int nbd_co_receive_request(NBDRequestData *req, NBDRequest *request, if (request->from > client->exp->size || request->from + request->len > client->exp->size) { error_setg(errp, "operation past EOF; From: %" PRIu64 ", Len: %" PRIu32 - ", Size: %" PRIu64, request->from, request->len, - (uint64_t)client->exp->size); + ", Size: %" PRId64, request->from, request->len, + client->exp->size); return (request->type == NBD_CMD_WRITE || request->type == NBD_CMD_WRITE_ZEROES) ? -ENOSPC : -EINVAL; } diff --git a/qemu-nbd.c b/qemu-nbd.c index ff4adb9b3eb..96c0829970c 100644 --- a/qemu-nbd.c +++ b/qemu-nbd.c @@ -176,7 +176,7 @@ static void read_partition(uint8_t *p, struct partition_record *r) } static int find_partition(BlockBackend *blk, int partition, - off_t *offset, off_t *size) + uint64_t *offset, int64_t *size) { struct partition_record mbr[4]; uint8_t data[MBR_SIZE]; @@ -500,14 +500,14 @@ int main(int argc, char **argv) { BlockBackend *blk; BlockDriverState *bs; -off_t dev_offset = 0; +uint64_t dev_offset = 0; uint16_t nbdflags = 0; bool disconnect = false; const char *bindto = NULL; const char *port = NULL; char *sockpath = NULL; char *device = NULL; -off_t fd_size; +int64_t fd_size; QemuOpts *sn_opts = NULL; const char *sn_id_or_name = NULL; const char *sopt = "hVb:o:p:rsnP:c:dvk:e:f:tl:x:T:D:B:"; @@ -665,10 +665,6 @@ int main(int argc, char **argv) error_report("Invalid offset `%s'", optarg); exit(EXIT_FAILURE); } -if (dev_offset < 0) { -error_report("Offset must be positive `%s'", optarg); -exit(EXIT_FAILURE); -} break; case 'l': if (strstart(optarg,
[Qemu-devel] [PATCH v3 06/19] qemu-nbd: Avoid strtol open-coding
Our copy-and-pasted open-coding of strtol handling forgot to handle overflow conditions. Use qemu_strto*() instead. In the case of --partition, since we insist on a user-supplied partition to be non-zero, we can use 0 rather than -1 for our initial value to distinguish when a partition is not being served, for slightly more optimal code. The error messages for out-of-bounds values are less specific, but should not be a terrible loss in quality. Signed-off-by: Eric Blake Message-Id: <20181215135324.152629-8-ebl...@redhat.com> --- v3: rebase to use int64_t rather than off_t [Vladimir] --- qemu-nbd.c | 28 +--- 1 file changed, 9 insertions(+), 19 deletions(-) diff --git a/qemu-nbd.c b/qemu-nbd.c index 96c0829970c..4670b659167 100644 --- a/qemu-nbd.c +++ b/qemu-nbd.c @@ -546,9 +546,8 @@ int main(int argc, char **argv) }; int ch; int opt_ind = 0; -char *end; int flags = BDRV_O_RDWR; -int partition = -1; +int partition = 0; int ret = 0; bool seen_cache = false; bool seen_discard = false; @@ -660,9 +659,8 @@ int main(int argc, char **argv) port = optarg; break; case 'o': -dev_offset = strtoll (optarg, , 0); -if (*end) { -error_report("Invalid offset `%s'", optarg); +if (qemu_strtou64(optarg, NULL, 0, _offset) < 0) { +error_report("Invalid offset '%s'", optarg); exit(EXIT_FAILURE); } break; @@ -684,13 +682,9 @@ int main(int argc, char **argv) flags &= ~BDRV_O_RDWR; break; case 'P': -partition = strtol(optarg, , 0); -if (*end) { -error_report("Invalid partition `%s'", optarg); -exit(EXIT_FAILURE); -} -if (partition < 1 || partition > 8) { -error_report("Invalid partition %d", partition); +if (qemu_strtoi(optarg, NULL, 0, ) < 0 || +partition < 1 || partition > 8) { +error_report("Invalid partition '%s'", optarg); exit(EXIT_FAILURE); } break; @@ -711,15 +705,11 @@ int main(int argc, char **argv) device = optarg; break; case 'e': -shared = strtol(optarg, , 0); -if (*end) { +if (qemu_strtoi(optarg, NULL, 0, ) < 0 || +shared < 1) { error_report("Invalid shared device number '%s'", optarg); exit(EXIT_FAILURE); } -if (shared < 1) { -error_report("Shared device number must be greater than 0"); -exit(EXIT_FAILURE); -} break; case 'f': fmt = optarg; @@ -1007,7 +997,7 @@ int main(int argc, char **argv) } fd_size -= dev_offset; -if (partition != -1) { +if (partition) { int64_t limit; if (dev_offset) { -- 2.20.1
[Qemu-devel] [PATCH v3 19/19] iotests: Enhance 223, 233 to cover 'qemu-nbd --list'
Any good new feature deserves some regression testing :) Coverage includes: - 223: what happens when there are 0 or more than 1 export, proof that we can see multiple contexts including qemu:dirty-bitmap - 233: proof that we can list over TLS, and that mix-and-match of plain/TLS listings will behave sanely Signed-off-by: Eric Blake Reviewed-by: Richard W.M. Jones Tested-by: Richard W.M. Jones Reviewed-by: Vladimir Sementsov-Ogievskiy --- v3: Rebase to earlier changes --- tests/qemu-iotests/223 | 2 ++ tests/qemu-iotests/223.out | 20 tests/qemu-iotests/233 | 19 +-- tests/qemu-iotests/233.out | 15 +++ 4 files changed, 50 insertions(+), 6 deletions(-) diff --git a/tests/qemu-iotests/223 b/tests/qemu-iotests/223 index 773892dbe60..f120a016460 100755 --- a/tests/qemu-iotests/223 +++ b/tests/qemu-iotests/223 @@ -127,6 +127,7 @@ _send_qemu_cmd $QEMU_HANDLE '{"execute":"nbd-server-start", _send_qemu_cmd $QEMU_HANDLE '{"execute":"nbd-server-start", "arguments":{"addr":{"type":"unix", "data":{"path":"'"$TEST_DIR/nbd"1'"' "error" # Attempt second server +$QEMU_NBD_PROG -L -k "$TEST_DIR/nbd" _send_qemu_cmd $QEMU_HANDLE '{"execute":"nbd-server-add", "arguments":{"device":"n", "bitmap":"b"}}' "return" _send_qemu_cmd $QEMU_HANDLE '{"execute":"nbd-server-add", @@ -142,6 +143,7 @@ _send_qemu_cmd $QEMU_HANDLE '{"execute":"nbd-server-add", _send_qemu_cmd $QEMU_HANDLE '{"execute":"nbd-server-add", "arguments":{"device":"n", "name":"n2", "writable":true, "bitmap":"b2"}}' "return" +$QEMU_NBD_PROG -L -k "$TEST_DIR/nbd" echo echo "=== Contrast normal status to large granularity dirty-bitmap ===" diff --git a/tests/qemu-iotests/223.out b/tests/qemu-iotests/223.out index 0de5240a75e..6476b77ba20 100644 --- a/tests/qemu-iotests/223.out +++ b/tests/qemu-iotests/223.out @@ -30,12 +30,32 @@ wrote 2097152/2097152 bytes at offset 2097152 {"error": {"class": "GenericError", "desc": "NBD server not running"}} {"return": {}} {"error": {"class": "GenericError", "desc": "NBD server already running"}} +exports available: 0 {"return": {}} {"error": {"class": "GenericError", "desc": "Cannot find device=nosuch nor node_name=nosuch"}} {"error": {"class": "GenericError", "desc": "NBD server already has export named 'n'"}} {"error": {"class": "GenericError", "desc": "Enabled bitmap 'b2' incompatible with readonly export"}} {"error": {"class": "GenericError", "desc": "Bitmap 'b3' is not found"}} {"return": {}} +exports available: 2 + export: 'n' + size: 4194304 + flags: 0x4ef ( readonly flush fua trim zeroes df cache ) + min block: 512 + opt block: 4096 + max block: 33554432 + available meta contexts: 2 + base:allocation + qemu:dirty-bitmap:b + export: 'n2' + size: 4194304 + flags: 0x4ed ( flush fua trim zeroes df cache ) + min block: 512 + opt block: 4096 + max block: 33554432 + available meta contexts: 2 + base:allocation + qemu:dirty-bitmap:b2 === Contrast normal status to large granularity dirty-bitmap === diff --git a/tests/qemu-iotests/233 b/tests/qemu-iotests/233 index 1814efe..a6ef7b20fb4 100755 --- a/tests/qemu-iotests/233 +++ b/tests/qemu-iotests/233 @@ -68,10 +68,12 @@ echo echo "== check TLS client to plain server fails ==" nbd_server_start_tcp_socket -f $IMGFMT "$TEST_IMG" -$QEMU_IMG info --image-opts \ ---object tls-creds-x509,dir=${tls_dir}/client1,endpoint=client,id=tls0 \ +obj=tls-creds-x509,dir=${tls_dir}/client1,endpoint=client,id=tls0 +$QEMU_IMG info --image-opts --object $obj \ driver=nbd,host=$nbd_tcp_addr,port=$nbd_tcp_port,tls-creds=tls0 \ 2>&1 | sed "s/$nbd_tcp_port/PORT/g" +$QEMU_NBD_PROG -L -b $nbd_tcp_addr -p $nbd_tcp_port --object $obj \ +--tls-creds=tls0 nbd_server_stop @@ -84,20 +86,25 @@ nbd_server_start_tcp_socket \ -f $IMGFMT "$TEST_IMG" $QEMU_IMG info nbd://localhost:$nbd_tcp_port 2>&1 | sed "s/$nbd_tcp_port/PORT/g" +$QEMU_NBD_PROG -L -b $nbd_tcp_addr -p $nbd_tcp_port echo echo "== check TLS works ==" -$QEMU_IMG info --image-opts \ ---object tls-creds-x509,dir=${tls_dir}/client1,endpoint=client,id=tls0 \ +obj=tls-creds-x509,dir=${tls_dir}/client1,endpoint=client,id=tls0 +$QEMU_IMG info --image-opts --object $obj \ driver=nbd,host=$nbd_tcp_addr,port=$nbd_tcp_port,tls-creds=tls0 \ 2>&1 | sed "s/$nbd_tcp_port/PORT/g" +$QEMU_NBD_PROG -L -b $nbd_tcp_addr -p $nbd_tcp_port --object $obj \ +--tls-creds=tls0 echo echo "== check TLS with different CA fails ==" -$QEMU_IMG info --image-opts \ ---object tls-creds-x509,dir=${tls_dir}/client2,endpoint=client,id=tls0 \ +obj=tls-creds-x509,dir=${tls_dir}/client2,endpoint=client,id=tls0 +$QEMU_IMG info --image-opts --object $obj \ driver=nbd,host=$nbd_tcp_addr,port=$nbd_tcp_port,tls-creds=tls0 \ 2>&1 | sed "s/$nbd_tcp_port/PORT/g" +$QEMU_NBD_PROG -L -b $nbd_tcp_addr -p $nbd_tcp_port --object $obj \ +--tls-creds=tls0 echo echo "== perform I/O over TLS ==" diff --git
[Qemu-devel] [PATCH v3 09/19] nbd/client: Change signature of nbd_negotiate_simple_meta_context()
Pass 'info' instead of three separate parameters related to info, when requesting the server to set the meta context. Update the NBDExportInfo struct to rename the received id field to match the fact that we are currently overloading the field to match whatever context the user supplied through the x-dirty-bitmap hack, as well as adding a TODO comment to remind future patches about a desire to request two contexts at once. Signed-off-by: Eric Blake Message-Id: <20181215135324.152629-12-ebl...@redhat.com> Reviewed-by: Richard W.M. Jones Reviewed-by: Vladimir Sementsov-Ogievskiy --- include/block/nbd.h | 2 +- block/nbd-client.c | 4 ++-- nbd/client.c| 53 + 3 files changed, 28 insertions(+), 31 deletions(-) diff --git a/include/block/nbd.h b/include/block/nbd.h index 6d76d3dc221..bdc7ec7195a 100644 --- a/include/block/nbd.h +++ b/include/block/nbd.h @@ -276,7 +276,7 @@ struct NBDExportInfo { uint32_t opt_block; uint32_t max_block; -uint32_t meta_base_allocation_id; +uint32_t context_id; }; typedef struct NBDExportInfo NBDExportInfo; diff --git a/block/nbd-client.c b/block/nbd-client.c index 3309376bc16..813539676d2 100644 --- a/block/nbd-client.c +++ b/block/nbd-client.c @@ -249,11 +249,11 @@ static int nbd_parse_blockstatus_payload(NBDClientSession *client, } context_id = payload_advance32(); -if (client->info.meta_base_allocation_id != context_id) { +if (client->info.context_id != context_id) { error_setg(errp, "Protocol error: unexpected context id %d for " "NBD_REPLY_TYPE_BLOCK_STATUS, when negotiated context " "id is %d", context_id, - client->info.meta_base_allocation_id); + client->info.context_id); return -EINVAL; } diff --git a/nbd/client.c b/nbd/client.c index 8227e69478a..77993890f04 100644 --- a/nbd/client.c +++ b/nbd/client.c @@ -630,26 +630,30 @@ static QIOChannel *nbd_receive_starttls(QIOChannel *ioc, } /* nbd_negotiate_simple_meta_context: - * Set one meta context. Simple means that reply must contain zero (not - * negotiated) or one (negotiated) contexts. More contexts would be considered - * as a protocol error. It's also implied that meta-data query equals queried - * context name, so, if server replies with something different than @context, - * it is considered an error too. - * return 1 for successful negotiation, context_id is set + * Request the server to set the meta context for export @info->name + * using @info->x_dirty_bitmap with a fallback to "base:allocation", + * setting @info->context_id to the resulting id. Fail if the server + * responds with more than one context or with a context different + * than the query. + * return 1 for successful negotiation, *0 if operation is unsupported, *-1 with errp set for any other error */ static int nbd_negotiate_simple_meta_context(QIOChannel *ioc, - const char *export, - const char *context, - uint32_t *context_id, + NBDExportInfo *info, Error **errp) { +/* + * TODO: Removing the x_dirty_bitmap hack will mean refactoring + * this function to request and store ids for multiple contexts + * (both base:allocation and a dirty bitmap), at which point this + * function should lose the term _simple. + */ int ret; NBDOptionReply reply; -uint32_t received_id = 0; +const char *context = info->x_dirty_bitmap ?: "base:allocation"; bool received = false; -uint32_t export_len = strlen(export); +uint32_t export_len = strlen(info->name); uint32_t context_len = strlen(context); uint32_t data_len = sizeof(export_len) + export_len + sizeof(uint32_t) + /* number of queries */ @@ -657,9 +661,9 @@ static int nbd_negotiate_simple_meta_context(QIOChannel *ioc, char *data = g_malloc(data_len); char *p = data; -trace_nbd_opt_meta_request(context, export); +trace_nbd_opt_meta_request(context, info->name); stl_be_p(p, export_len); -memcpy(p += sizeof(export_len), export, export_len); +memcpy(p += sizeof(export_len), info->name, export_len); stl_be_p(p += export_len, 1); stl_be_p(p += sizeof(uint32_t), context_len); memcpy(p += sizeof(context_len), context, context_len); @@ -685,7 +689,7 @@ static int nbd_negotiate_simple_meta_context(QIOChannel *ioc, if (reply.type == NBD_REP_META_CONTEXT) { char *name; -if (reply.length != sizeof(received_id) + context_len) { +if (reply.length != sizeof(info->context_id) + context_len) { error_setg(errp, "Failed to negotiate meta context '%s', server "
[Qemu-devel] [PATCH v3 16/19] nbd/client: Add meta contexts to nbd_receive_export_list()
We want to be able to detect whether a given qemu NBD server is exposing the right export(s) and dirty bitmaps, at least for regression testing. We could use 'nbd-client -l' from the upstream NBD project to list exports, but it's annoying to rely on out-of-tree binaries; furthermore, nbd-client doesn't necessarily know about all of the qemu NBD extensions. Thus, we plan on adding a new mode to qemu-nbd that merely sniffs all possible information from the server during handshake phase, then disconnects and dumps the information. This patch continues the work of the previous patch, by adding the ability to track the list of available meta contexts into NBDExportInfo. It benefits from the recent refactoring patches with a new nbd_list_meta_contexts() that reuses much of the same framework as setting a meta context. Note: a malicious server could exhaust memory of a client by feeding an unending loop of contexts; perhaps we could place a limit on how many we are willing to receive. But this is no different from our earlier analysis on a server sending an unending list of exports, and the death of a client due to memory exhaustion when the client was going to exit soon anyways is not really a denial of service attack. Signed-off-by: Eric Blake Message-Id: <20181215135324.152629-20-ebl...@redhat.com> Reviewed-by: Richard W.M. Jones --- v3: mention security (non-)issue in commit message [Rich] --- include/block/nbd.h | 2 ++ nbd/client.c| 41 +++-- 2 files changed, 41 insertions(+), 2 deletions(-) diff --git a/include/block/nbd.h b/include/block/nbd.h index e9a442ce7e9..80ee3cc997e 100644 --- a/include/block/nbd.h +++ b/include/block/nbd.h @@ -284,6 +284,8 @@ struct NBDExportInfo { /* Set by server results during nbd_receive_export_list() */ char *description; +int n_contexts; +char **contexts; }; typedef struct NBDExportInfo NBDExportInfo; diff --git a/nbd/client.c b/nbd/client.c index a5a705a67f7..2001e6e8160 100644 --- a/nbd/client.c +++ b/nbd/client.c @@ -817,6 +817,36 @@ static int nbd_negotiate_simple_meta_context(QIOChannel *ioc, return received; } +/* + * nbd_list_meta_contexts: + * Request the server to list all meta contexts for export @info->name. + * return 0 if list is complete (even if empty), + *-1 with errp set for any other error + */ +static int nbd_list_meta_contexts(QIOChannel *ioc, + NBDExportInfo *info, + Error **errp) +{ +int ret; + +if (nbd_send_one_meta_context(ioc, NBD_OPT_LIST_META_CONTEXT, + info->name, NULL, errp) < 0) { +return -1; +} + +while (1) { +char *context; + +ret = nbd_receive_one_meta_context(ioc, NBD_OPT_LIST_META_CONTEXT, + , NULL, errp); +if (ret <= 0) { +return ret; +} +info->contexts = g_renew(char *, info->contexts, ++info->n_contexts); +info->contexts[info->n_contexts - 1] = context; +} +} + /* * nbd_start_negotiate: * Start the handshake to the server. After a positive return, the server @@ -1063,7 +1093,7 @@ int nbd_receive_negotiate(QIOChannel *ioc, QCryptoTLSCreds *tlscreds, /* Clean up result of nbd_receive_export_list */ void nbd_free_export_list(NBDExportInfo *info, int count) { -int i; +int i, j; if (!info) { return; @@ -1072,6 +1102,10 @@ void nbd_free_export_list(NBDExportInfo *info, int count) for (i = 0; i < count; i++) { g_free(info[i].name); g_free(info[i].description); +for (j = 0; j < info[i].n_contexts; j++) { +g_free(info[i].contexts[j]); +} +g_free(info[i].contexts); } g_free(info); } @@ -1139,7 +1173,10 @@ int nbd_receive_export_list(QIOChannel *ioc, QCryptoTLSCreds *tlscreds, break; } -/* TODO: Grab meta contexts */ +if (result == 3 && +nbd_list_meta_contexts(ioc, [i], errp) < 0) { +goto out; +} } /* Send NBD_OPT_ABORT as a courtesy before hanging up */ -- 2.20.1
[Qemu-devel] [PATCH v3 07/19] nbd/client: Refactor nbd_receive_list()
Right now, nbd_receive_list() is only called by nbd_receive_query_exports(), which in turn is only called if the server lacks NBD_OPT_GO but has working option negotiation, and is merely used as a quality-of-implementation trick since servers can't give decent errors for NBD_OPT_EXPORT_NAME. However, servers that lack NBD_OPT_GO are becoming increasingly rare (nbdkit was a latecomer, in Aug 2018, but qemu has been such a server since commit f37708f6 in July 2017 and released in 2.10), so it no longer makes sense to micro-optimize that function for performance. Furthermore, when debugging a server's implementation, tracing the full reply (both names and descriptions) is useful, not to mention that upcoming patches adding 'qemu-nbd --list' will want to collect that data. And when you consider that a server can send an export name up to the NBD protocol length limit of 4k; but our current NBD_MAX_NAME_SIZE is only 256, we can't trace all valid server names without more storage, but 4k is large enough that the heap is better than the stack for long names. Thus, I'm changing the division of labor, with nbd_receive_list() now always malloc'ing a result on success (the malloc is bounded by the fact that we reject servers with a reply length larger than 32M), and moving the comparison to 'wantname' to the caller. There is a minor change in behavior where a server with 0 exports (an immediate NBD_REP_ACK reply) is now no longer distinguished from a server without LIST support (NBD_REP_ERR_UNSUP); this information could be preserved with a complication to the calling contract to provide a bit more information, but I didn't see the point. After all, the worst that can happen if our guess at a match is wrong is that the caller will get a cryptic disconnect when NBD_OPT_EXPORT_NAME fails (which is no different from what would happen if we had not tried LIST), while treating an empty list as immediate failure would prevent connecting to really old servers that really did lack LIST. Besides, NBD servers with 0 exports are rare (qemu can do it when using QMP nbd-server-start without nbd-server-add - but qemu understands NBD_OPT_GO and thus won't tickle this change in behavior). Fix the spelling of foundExport to match coding standards while in the area. Signed-off-by: Eric Blake Message-Id: <20181215135324.152629-10-ebl...@redhat.com> Reviewed-by: Richard W.M. Jones Reviewed-by: Vladimir Sementsov-Ogievskiy --- v3: comment tweak, s/listEmpty/list_empty/, s/foundExport/found_export/ [Vladimir] --- nbd/client.c | 91 ++-- nbd/trace-events | 1 + 2 files changed, 59 insertions(+), 33 deletions(-) diff --git a/nbd/client.c b/nbd/client.c index f625c207c54..fd4ba8dec37 100644 --- a/nbd/client.c +++ b/nbd/client.c @@ -234,18 +234,24 @@ static int nbd_handle_reply_err(QIOChannel *ioc, NBDOptionReply *reply, return result; } -/* Process another portion of the NBD_OPT_LIST reply. Set *@match if - * the current reply matches @want or if the server does not support - * NBD_OPT_LIST, otherwise leave @match alone. Return 0 if iteration - * is complete, positive if more replies are expected, or negative - * with @errp set if an unrecoverable error occurred. */ -static int nbd_receive_list(QIOChannel *ioc, const char *want, bool *match, +/* nbd_receive_list: + * Process another portion of the NBD_OPT_LIST reply, populating any + * name received into *@name. If @description is non-NULL, and the + * server provided a description, that is also populated. The caller + * must eventually call g_free() on success. + * Returns 1 if name and description were set and iteration must continue, + * 0 if iteration is complete (including if OPT_LIST unsupported), + * -1 with @errp set if an unrecoverable error occurred. + */ +static int nbd_receive_list(QIOChannel *ioc, char **name, char **description, Error **errp) { +int ret = -1; NBDOptionReply reply; uint32_t len; uint32_t namelen; -char name[NBD_MAX_NAME_SIZE + 1]; +char *local_name = NULL; +char *local_desc = NULL; int error; if (nbd_receive_option_reply(ioc, NBD_OPT_LIST, , errp) < 0) { @@ -253,9 +259,6 @@ static int nbd_receive_list(QIOChannel *ioc, const char *want, bool *match, } error = nbd_handle_reply_err(ioc, , errp); if (error <= 0) { -/* The server did not support NBD_OPT_LIST, so set *match on - * the assumption that any name will be accepted. */ -*match = true; return error; } len = reply.length; @@ -292,33 +295,38 @@ static int nbd_receive_list(QIOChannel *ioc, const char *want, bool *match, nbd_send_opt_abort(ioc); return -1; } -if (namelen != strlen(want)) { -if (nbd_drop(ioc, len, errp) < 0) { -error_prepend(errp, - "failed to skip export name with wrong length: "); -
[Qemu-devel] [PATCH v3 17/19] qemu-nbd: Add --list option
We want to be able to detect whether a given qemu NBD server is exposing the right export(s) and dirty bitmaps, at least for regression testing. We could use 'nbd-client -l' from the upstream NBD project to list exports, but it's annoying to rely on out-of-tree binaries; furthermore, nbd-client doesn't necessarily know about all of the qemu NBD extensions. Thus, it is time to add a new mode to qemu-nbd that merely sniffs all possible information from the server during handshake phase, then disconnects and dumps the information. This patch actually implements --list/-L, while reusing other options such as --tls-creds for now designating how to connect as the client (rather than their non-list usage of how to operate as the server). I debated about adding this functionality to something akin to 'qemu-img info' - but that tool does not readily lend itself to connecting to an arbitrary NBD server without also tying to a specific export (I may, however, still add ImageInfoSpecificNBD for reporting the bitmaps available when connecting to a single export). And, while it may feel a bit odd that normally qemu-nbd is a server but 'qemu-nbd -L' is a client, we are not really making the qemu-nbd binary that much larger, because 'qemu-nbd -c' has to operate as both server and client simultaneously across two threads when feeding the kernel module for /dev/nbdN access. Sample output: $ qemu-nbd -L exports available: 1 export: '' size: 65536 flags: 0x4ed ( flush fua trim zeroes df cache ) min block: 512 opt block: 4096 max block: 33554432 available meta contexts: 1 base:allocation Note that the output only lists sizes if the server sent NBD_FLAG_HAS_FLAGS, because a newstyle server does not give the size otherwise. It has the side effect that for really old servers that did not send any flags, the size is not output even though it was available. However, I'm not too concerned about that - oldstyle servers are (rightfully) getting less common to encounter (qemu 3.0 was the last version where we even serve it), and most existing servers that still even offer oldstyle negotiation (such as nbdkit) still send flags (since that was added to the NBD protocol in 2007 to permit read-only connections). Not done here, but maybe worth future experiments: capture the meat of NBDExportInfo into a QAPI struct, and use the generated QAPI pretty-printers instead of hand-rolling our output loop. It would also permit us to add a JSON output mode for machine parsing. Signed-off-by: Eric Blake Message-Id: <20181215135324.152629-21-ebl...@redhat.com> Reviewed-by: Richard W.M. Jones --- v3: comment tweak [Rich], rebase to earlier changes --- qemu-nbd.texi | 27 +++-- qemu-nbd.c| 155 +- 2 files changed, 165 insertions(+), 17 deletions(-) diff --git a/qemu-nbd.texi b/qemu-nbd.texi index 3f22559beb4..65caeb7874a 100644 --- a/qemu-nbd.texi +++ b/qemu-nbd.texi @@ -2,6 +2,8 @@ @c man begin SYNOPSIS @command{qemu-nbd} [OPTION]... @var{filename} +@command{qemu-nbd} @option{-L} [OPTION]... + @command{qemu-nbd} @option{-d} @var{dev} @c man end @end example @@ -14,6 +16,8 @@ Other uses: @itemize @item Bind a /dev/nbdX block device to a QEMU server (on Linux). +@item +As a client to query exports of a remote NBD server. @end itemize @c man end @@ -31,13 +35,15 @@ See the @code{qemu(1)} manual page for full details of the properties supported. The common object types that it makes sense to define are the @code{secret} object, which is used to supply passwords and/or encryption keys, and the @code{tls-creds} object, which is used to supply TLS -credentials for the qemu-nbd server. +credentials for the qemu-nbd server or client. @item -p, --port=@var{port} -The TCP port to listen on (default @samp{10809}). +The TCP port to listen on as a server, or connect to as a client +(default @samp{10809}). @item -o, --offset=@var{offset} The offset into the image. @item -b, --bind=@var{iface} -The interface to bind to (default @samp{0.0.0.0}). +The interface to bind to as a server, or connect to as a client +(default @samp{0.0.0.0}). @item -k, --socket=@var{path} Use a unix socket with path @var{path}. @item --image-opts @@ -97,10 +103,14 @@ Set the NBD volume export name (default of a zero-length string). @item -D, --description=@var{description} Set the NBD volume export description, as a human-readable string. +@item -L, --list +Connect as a client and list all details about the exports exposed by +a remote NBD server. @item --tls-creds=ID Enable mandatory TLS encryption for the server by setting the ID of the TLS credentials object previously created with the --object -option. +option; or provide the credentials needed for connecting as a client +in list mode. @item --fork Fork off the server process and exit the parent once the server is running. @item -v, --verbose @@ -159,6 +169,15 @@ qemu-nbd -c /dev/nbd0 -f qcow2 file.qcow2 qemu-nbd
[Qemu-devel] [PATCH v3 18/19] nbd/client: Work around 3.0 bug for listing meta contexts
Commit 3d068aff forgot to advertise available qemu: contexts when the client requests a list with 0 queries. Furthermore, 3.0 shipped with a qemu-img hack of x-dirty-bitmap (commit 216ee365) that _silently_ acts as though the entire image is clean if a requested bitmap is not present. Both bugs have been recently fixed, so that a modern qemu server gives full context output right away, and the client refuses a connection if a requested x-dirty-bitmap was not found. Still, it is likely that there will be users that have to work with a mix of old and new qemu versions, depending on which features get backported where, at which point being able to rely on 'qemu-img --list' output to know for sure whether a given NBD export has the desired dirty bitmap is much nicer than blindly connecting and risking that the entire image may appear clean. We can make our --list code smart enough to work around buggy servers by tracking whether we've seen any qemu: replies in the original 0-query list; if not, repeat with a single query on "qemu:" (which may still have no replies, but then we know for sure we didn't trip up on the server bug). Signed-off-by: Eric Blake Message-Id: <20181215135324.152629-22-ebl...@redhat.com> --- nbd/client.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/nbd/client.c b/nbd/client.c index 2001e6e8160..64f3e45edd4 100644 --- a/nbd/client.c +++ b/nbd/client.c @@ -21,6 +21,7 @@ #include "qapi/error.h" #include "trace.h" #include "nbd-internal.h" +#include "qemu/cutils.h" /* Definitions for opaque data types */ @@ -828,6 +829,8 @@ static int nbd_list_meta_contexts(QIOChannel *ioc, Error **errp) { int ret; +int seen_any = false; +int seen_qemu = false; if (nbd_send_one_meta_context(ioc, NBD_OPT_LIST_META_CONTEXT, info->name, NULL, errp) < 0) { @@ -839,9 +842,25 @@ static int nbd_list_meta_contexts(QIOChannel *ioc, ret = nbd_receive_one_meta_context(ioc, NBD_OPT_LIST_META_CONTEXT, , NULL, errp); +if (ret == 0 && seen_any && !seen_qemu) { +/* + * Work around qemu 3.0 bug: the server forgot to send + * "qemu:" replies to 0 queries. If we saw at least one + * reply (probably base:allocation), but none of them were + * qemu:, then run a more specific query to make sure. + */ +seen_qemu = true; +if (nbd_send_one_meta_context(ioc, NBD_OPT_LIST_META_CONTEXT, + info->name, "qemu:", errp) < 0) { +return -1; +} +continue; +} if (ret <= 0) { return ret; } +seen_any = true; +seen_qemu |= strstart(context, "qemu:", NULL); info->contexts = g_renew(char *, info->contexts, ++info->n_contexts); info->contexts[info->n_contexts - 1] = context; } -- 2.20.1
Re: [Qemu-devel] block: Update flags in bdrv_set_read_only()
commit eeae6a596b0efc092f5101c67683053e245e6250 Author: Kevin Wolf Date: Tue Oct 9 16:57:12 2018 +0200 block: Update flags in bdrv_set_read_only() To fully change the read-only state of a node, we must not only change bs->read_only, but also update bs->open_flags. sort of broke vfat support: $ qemu-system-x86_64 -hda fat:foo/ WARNING: Image format was not specified for 'json:{"fat-type": 0, "dir": "foo/", "driver": "vvfat", "floppy": false, "rw": false}' and probing guessed raw. Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted. Specify the 'raw' format explicitly to remove the restrictions. qemu-system-x86_64: Initialization of device ide-hd failed: Block node is read-only $ _ The warning is annoying but harmless, but the read-only error is fatal. "Sort-of" is because there's a somewhat strange workaround: -hda fat:rw:foo/ but it is a bit more dangerous as well. It looks like vfat should be handled differently somewhere, to eliminate both the warning and the error? Thanks, /mjt
[Qemu-devel] 3.1: second invocation of migrate crashes qemu
$ qemu-system-x86_64 -monitor stdio -hda foo.img QEMU 3.1.0 monitor - type 'help' for more information (qemu) stop (qemu) migrate "exec:cat >/dev/null" (qemu) migrate "exec:cat >/dev/null" qemu-system-x86_64: /build/qemu/qemu-3.1/block.c:4647: bdrv_inactivate_recurse: Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed. Aborted (it is irrelevant what's in foo.img, it only needs to be initialized). If it is worth to bisect I'll do that tomorrow. Thanks, /mjt
Re: [Qemu-devel] [PATCH] hw/pvrdma: Post CQE when receive invalid gid index
On 1/9/19 10:15 PM, Yuval Shaia wrote: This error should propagate back to guest. Signed-off-by: Yuval Shaia --- hw/rdma/rdma_backend.h | 1 + hw/rdma/vmw/pvrdma_qp_ops.c | 6 -- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h index a9ba40ae48..5114c90e67 100644 --- a/hw/rdma/rdma_backend.h +++ b/hw/rdma/rdma_backend.h @@ -32,6 +32,7 @@ #define VENDOR_ERR_INVLKEY 0x207 #define VENDOR_ERR_MR_SMALL 0x208 #define VENDOR_ERR_INV_MAD_BUFF 0x209 +#define VENDOR_ERR_INV_GID_IDX 0x210 /* Add definition for QP0 and QP1 as there is no userspace enums for them */ enum ibv_special_qp_type { diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c index 465bee8641..0565eba981 100644 --- a/hw/rdma/vmw/pvrdma_qp_ops.c +++ b/hw/rdma/vmw/pvrdma_qp_ops.c @@ -178,7 +178,8 @@ int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle) sgid = rdma_rm_get_gid(>rdma_dev_res, wqe->hdr.wr.ud.av.gid_index); if (!sgid) { pr_dbg("Fail to get gid for idx %d\n", wqe->hdr.wr.ud.av.gid_index); -return -EIO; +complete_with_error(VENDOR_ERR_INV_GID_IDX, comp_ctx); Informing the guest is OK, but are you sure it makes sense to continue sending the other work requests? Is maybe safer to return with error as before? +continue; } pr_dbg("sgid_id=%d, sgid=0x%llx\n", wqe->hdr.wr.ud.av.gid_index, sgid->global.interface_id); @@ -189,7 +190,8 @@ int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle) if (sgid_idx <= 0) { pr_dbg("Fail to get bk sgid_idx for sgid_idx %d\n", wqe->hdr.wr.ud.av.gid_index); -return -EIO; +complete_with_error(VENDOR_ERR_INV_GID_IDX, comp_ctx); +continue; Same question here. Thanks, Marcel } if (wqe->hdr.num_sge > dev->dev_attr.max_sge) {
Re: [Qemu-devel] [PATCH] hw/rdma: Delete unused struct member
On 1/9/19 10:19 PM, Yuval Shaia wrote: This member is used only in init_device_caps function, make it local. Signed-off-by: Yuval Shaia --- hw/rdma/rdma_backend.c | 26 ++ hw/rdma/rdma_backend_defs.h | 1 - 2 files changed, 14 insertions(+), 13 deletions(-) diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c index 16dca69ee9..b49edaacaf 100644 --- a/hw/rdma/rdma_backend.c +++ b/hw/rdma/rdma_backend.c @@ -917,23 +917,25 @@ void rdma_backend_destroy_qp(RdmaBackendQP *qp) static int init_device_caps(RdmaBackendDev *backend_dev, struct ibv_device_attr *dev_attr) { -if (ibv_query_device(backend_dev->context, _dev->dev_attr)) { +struct ibv_device_attr bk_dev_attr; + +if (ibv_query_device(backend_dev->context, _dev_attr)) { return -EIO; } dev_attr->max_sge = MAX_SGE; -CHK_ATTR(dev_attr, backend_dev->dev_attr, max_mr_size, "%" PRId64); -CHK_ATTR(dev_attr, backend_dev->dev_attr, max_qp, "%d"); -CHK_ATTR(dev_attr, backend_dev->dev_attr, max_sge, "%d"); -CHK_ATTR(dev_attr, backend_dev->dev_attr, max_qp_wr, "%d"); -CHK_ATTR(dev_attr, backend_dev->dev_attr, max_cq, "%d"); -CHK_ATTR(dev_attr, backend_dev->dev_attr, max_cqe, "%d"); -CHK_ATTR(dev_attr, backend_dev->dev_attr, max_mr, "%d"); -CHK_ATTR(dev_attr, backend_dev->dev_attr, max_pd, "%d"); -CHK_ATTR(dev_attr, backend_dev->dev_attr, max_qp_rd_atom, "%d"); -CHK_ATTR(dev_attr, backend_dev->dev_attr, max_qp_init_rd_atom, "%d"); -CHK_ATTR(dev_attr, backend_dev->dev_attr, max_ah, "%d"); +CHK_ATTR(dev_attr, bk_dev_attr, max_mr_size, "%" PRId64); +CHK_ATTR(dev_attr, bk_dev_attr, max_qp, "%d"); +CHK_ATTR(dev_attr, bk_dev_attr, max_sge, "%d"); +CHK_ATTR(dev_attr, bk_dev_attr, max_qp_wr, "%d"); +CHK_ATTR(dev_attr, bk_dev_attr, max_cq, "%d"); +CHK_ATTR(dev_attr, bk_dev_attr, max_cqe, "%d"); +CHK_ATTR(dev_attr, bk_dev_attr, max_mr, "%d"); +CHK_ATTR(dev_attr, bk_dev_attr, max_pd, "%d"); +CHK_ATTR(dev_attr, bk_dev_attr, max_qp_rd_atom, "%d"); +CHK_ATTR(dev_attr, bk_dev_attr, max_qp_init_rd_atom, "%d"); +CHK_ATTR(dev_attr, bk_dev_attr, max_ah, "%d"); return 0; } diff --git a/hw/rdma/rdma_backend_defs.h b/hw/rdma/rdma_backend_defs.h index 1e5c3dd3bf..15ae8b970e 100644 --- a/hw/rdma/rdma_backend_defs.h +++ b/hw/rdma/rdma_backend_defs.h @@ -41,7 +41,6 @@ typedef struct RdmaCmMux { } RdmaCmMux; typedef struct RdmaBackendDev { -struct ibv_device_attr dev_attr; RdmaBackendThread comp_thread; PCIDevice *dev; RdmaDeviceResources *rdma_dev_res; Reviewed-by: Marcel Apfelbaum Thanks, Marcel
Re: [Qemu-devel] [PATCH] hw/pvrdma: Remove max-sge command-line param
On 1/9/19 9:41 PM, Yuval Shaia wrote: This parameter has no effect, fix it. The function init_dev_caps sets the front-end's max-sge to MAX_SGE. Then it checks backend's max-sge and adjust it accordingly (we can't send more than what the device supports). On send and recv we need to make sure the num_sge in the WQE does not exceeds the backend device capability. This check is done in pvrdma level so check on rdma level is deleted. I think it makes sense to match max-sge with the underlying device capability. Reviewed-by: Marcel Apfelbaum Thanks, Marcel Signed-off-by: Yuval Shaia --- docs/pvrdma.txt | 1 - hw/rdma/rdma_backend.c | 23 ++- hw/rdma/rdma_backend.h | 11 +++ hw/rdma/vmw/pvrdma_main.c | 10 +- hw/rdma/vmw/pvrdma_qp_ops.c | 24 5 files changed, 42 insertions(+), 27 deletions(-) diff --git a/docs/pvrdma.txt b/docs/pvrdma.txt index 5175251b47..5973c0c68b 100644 --- a/docs/pvrdma.txt +++ b/docs/pvrdma.txt @@ -153,7 +153,6 @@ Ethernet function can be used for other Ethernet purposes such as IP. specify the port to use. If not set 1 will be used. - dev-caps-max-mr-size: The maximum size of MR. - dev-caps-max-qp: Maximum number of QPs. -- dev-caps-max-sge: Maximum number of SGE elements in WR. - dev-caps-max-cq: Maximum number of CQs. - dev-caps-max-mr: Maximum number of MRs. - dev-caps-max-pd: Maximum number of PDs. diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c index c28bfbd44d..16dca69ee9 100644 --- a/hw/rdma/rdma_backend.c +++ b/hw/rdma/rdma_backend.c @@ -32,17 +32,6 @@ #include "rdma_rm.h" #include "rdma_backend.h" -/* Vendor Errors */ -#define VENDOR_ERR_FAIL_BACKEND 0x201 -#define VENDOR_ERR_TOO_MANY_SGES0x202 -#define VENDOR_ERR_NOMEM0x203 -#define VENDOR_ERR_QP0 0x204 -#define VENDOR_ERR_INV_NUM_SGE 0x205 -#define VENDOR_ERR_MAD_SEND 0x206 -#define VENDOR_ERR_INVLKEY 0x207 -#define VENDOR_ERR_MR_SMALL 0x208 -#define VENDOR_ERR_INV_MAD_BUFF 0x209 - #define THR_NAME_LEN 16 #define THR_POLL_TO 5000 @@ -475,11 +464,6 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev, } pr_dbg("num_sge=%d\n", num_sge); -if (!num_sge || num_sge > MAX_SGE) { -pr_dbg("invalid num_sge=%d\n", num_sge); -complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_INV_NUM_SGE, ctx); -return; -} bctx = g_malloc0(sizeof(*bctx)); bctx->up_ctx = ctx; @@ -602,11 +586,6 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev, } pr_dbg("num_sge=%d\n", num_sge); -if (!num_sge || num_sge > MAX_SGE) { -pr_dbg("invalid num_sge=%d\n", num_sge); -complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_INV_NUM_SGE, ctx); -return; -} bctx = g_malloc0(sizeof(*bctx)); bctx->up_ctx = ctx; @@ -942,6 +921,8 @@ static int init_device_caps(RdmaBackendDev *backend_dev, return -EIO; } +dev_attr->max_sge = MAX_SGE; + CHK_ATTR(dev_attr, backend_dev->dev_attr, max_mr_size, "%" PRId64); CHK_ATTR(dev_attr, backend_dev->dev_attr, max_qp, "%d"); CHK_ATTR(dev_attr, backend_dev->dev_attr, max_sge, "%d"); diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h index 8cae40f827..a9ba40ae48 100644 --- a/hw/rdma/rdma_backend.h +++ b/hw/rdma/rdma_backend.h @@ -22,6 +22,17 @@ #include "rdma_rm_defs.h" #include "rdma_backend_defs.h" +/* Vendor Errors */ +#define VENDOR_ERR_FAIL_BACKEND 0x201 +#define VENDOR_ERR_TOO_MANY_SGES0x202 +#define VENDOR_ERR_NOMEM0x203 +#define VENDOR_ERR_QP0 0x204 +#define VENDOR_ERR_INV_NUM_SGE 0x205 +#define VENDOR_ERR_MAD_SEND 0x206 +#define VENDOR_ERR_INVLKEY 0x207 +#define VENDOR_ERR_MR_SMALL 0x208 +#define VENDOR_ERR_INV_MAD_BUFF 0x209 + /* Add definition for QP0 and QP1 as there is no userspace enums for them */ enum ibv_special_qp_type { IBV_QPT_SMI = 0, diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c index 838ad8a949..d2bdb5ba8c 100644 --- a/hw/rdma/vmw/pvrdma_main.c +++ b/hw/rdma/vmw/pvrdma_main.c @@ -43,7 +43,6 @@ static Property pvrdma_dev_properties[] = { DEFINE_PROP_UINT64("dev-caps-max-mr-size", PVRDMADev, dev_attr.max_mr_size, MAX_MR_SIZE), DEFINE_PROP_INT32("dev-caps-max-qp", PVRDMADev, dev_attr.max_qp, MAX_QP), -DEFINE_PROP_INT32("dev-caps-max-sge", PVRDMADev, dev_attr.max_sge, MAX_SGE), DEFINE_PROP_INT32("dev-caps-max-cq", PVRDMADev, dev_attr.max_cq, MAX_CQ), DEFINE_PROP_INT32("dev-caps-max-mr", PVRDMADev, dev_attr.max_mr, MAX_MR), DEFINE_PROP_INT32("dev-caps-max-pd", PVRDMADev, dev_attr.max_pd, MAX_PD), @@ -549,8 +548,9 @@ static void init_dev_caps(PVRDMADev *dev) sizeof(struct pvrdma_rq_wqe_hdr)); dev->dev_attr.max_qp_wr
Re: [Qemu-devel] [PATCH] hw/pvrdma: Make function pvrdma_qp_send/recv return void.
On 1/9/19 10:21 PM, Yuval Shaia wrote: The functions handles errors internaly, callers have nothing to do with the return value. Signed-off-by: Yuval Shaia --- hw/rdma/vmw/pvrdma_qp_ops.c | 14 ++ hw/rdma/vmw/pvrdma_qp_ops.h | 4 ++-- 2 files changed, 8 insertions(+), 10 deletions(-) diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c index 0565eba981..ce5a60e184 100644 --- a/hw/rdma/vmw/pvrdma_qp_ops.c +++ b/hw/rdma/vmw/pvrdma_qp_ops.c @@ -143,7 +143,7 @@ int pvrdma_qp_ops_init(void) return 0; } -int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle) +void pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle) { RdmaRmQP *qp; PvrdmaSqWqe *wqe; @@ -155,7 +155,8 @@ int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle) qp = rdma_rm_get_qp(>rdma_dev_res, qp_handle); if (unlikely(!qp)) { -return -EINVAL; +pr_dbg("Invalid qpn\n"); +return; } ring = (PvrdmaRing *)qp->opaque; @@ -212,11 +213,9 @@ int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle) wqe = pvrdma_ring_next_elem_read(ring); } - -return 0; } -int pvrdma_qp_recv(PVRDMADev *dev, uint32_t qp_handle) +void pvrdma_qp_recv(PVRDMADev *dev, uint32_t qp_handle) { RdmaRmQP *qp; PvrdmaRqWqe *wqe; @@ -226,7 +225,8 @@ int pvrdma_qp_recv(PVRDMADev *dev, uint32_t qp_handle) qp = rdma_rm_get_qp(>rdma_dev_res, qp_handle); if (unlikely(!qp)) { -return -EINVAL; +pr_dbg("Invalid qpn\n"); +return; } ring = &((PvrdmaRing *)qp->opaque)[1]; @@ -262,8 +262,6 @@ int pvrdma_qp_recv(PVRDMADev *dev, uint32_t qp_handle) wqe = pvrdma_ring_next_elem_read(ring); } - -return 0; } void pvrdma_cq_poll(RdmaDeviceResources *dev_res, uint32_t cq_handle) diff --git a/hw/rdma/vmw/pvrdma_qp_ops.h b/hw/rdma/vmw/pvrdma_qp_ops.h index ac46bf7fdf..31cb48ba29 100644 --- a/hw/rdma/vmw/pvrdma_qp_ops.h +++ b/hw/rdma/vmw/pvrdma_qp_ops.h @@ -20,8 +20,8 @@ int pvrdma_qp_ops_init(void); void pvrdma_qp_ops_fini(void); -int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle); -int pvrdma_qp_recv(PVRDMADev *dev, uint32_t qp_handle); +void pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle); +void pvrdma_qp_recv(PVRDMADev *dev, uint32_t qp_handle); void pvrdma_cq_poll(RdmaDeviceResources *dev_res, uint32_t cq_handle); #endif Reviewed-by: Marcel Apfelbaum Thanks, Marcel
[Qemu-devel] [PATCH 2/3] hw/rdma: modify struct initialization
Do not initialize structs with {0} since some CLANG versions do not support it. Use memset instead. Signed-off-by: Marcel Apfelbaum --- contrib/rdmacm-mux/main.c | 12 +--- hw/rdma/rdma_backend.c| 16 2 files changed, 21 insertions(+), 7 deletions(-) diff --git a/contrib/rdmacm-mux/main.c b/contrib/rdmacm-mux/main.c index 64676030c5..d01dc76927 100644 --- a/contrib/rdmacm-mux/main.c +++ b/contrib/rdmacm-mux/main.c @@ -350,9 +350,11 @@ static int get_fd(const char *mad, int *fd, __be64 *gid_ifid) static void *umad_recv_thread_func(void *args) { int rc; -RdmaCmMuxMsg msg = {0}; +RdmaCmMuxMsg msg; int fd = -2; +memset(, 0, sizeof(msg)); + msg.hdr.msg_type = RDMACM_MUX_MSG_TYPE_REQ; msg.hdr.op_code = RDMACM_MUX_OP_CODE_MAD; @@ -387,11 +389,13 @@ static void *umad_recv_thread_func(void *args) static int read_and_process(int fd) { int rc; -RdmaCmMuxMsg msg = {0}; +RdmaCmMuxMsg msg; struct umad_hdr *hdr; uint32_t *comm_id = 0; uint16_t attr_id; +memset(, 0, sizeof(msg)); + rc = recv(fd, , sizeof(msg), 0); syslog(LOG_DEBUG, "Socket %d, recv %d\n", fd, rc); @@ -744,7 +748,9 @@ static void signal_handler(int sig, siginfo_t *siginfo, void *context) static int init(void) { int rc; -struct sigaction sig = {0}; +struct sigaction sig; + +memset(, 0, sizeof(sig)); rc = init_listener(); if (rc) { diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c index c28bfbd44d..92e95aa640 100644 --- a/hw/rdma/rdma_backend.c +++ b/hw/rdma/rdma_backend.c @@ -190,9 +190,11 @@ static inline int rdmacm_mux_can_process_async(RdmaBackendDev *backend_dev) static int check_mux_op_status(CharBackend *mad_chr_be) { -RdmaCmMuxMsg msg = {0}; +RdmaCmMuxMsg msg; int ret; +memset(, 0, sizeof(msg)); + pr_dbg("Reading response\n"); ret = qemu_chr_fe_read_all(mad_chr_be, (uint8_t *), sizeof(msg)); if (ret != sizeof(msg)) { @@ -387,10 +389,12 @@ static int build_host_sge_array(RdmaDeviceResources *rdma_dev_res, static int mad_send(RdmaBackendDev *backend_dev, uint8_t sgid_idx, union ibv_gid *sgid, struct ibv_sge *sge, uint32_t num_sge) { -RdmaCmMuxMsg msg = {0}; +RdmaCmMuxMsg msg; char *hdr, *data; int ret; +memset(, 0, sizeof(msg)); + pr_dbg("num_sge=%d\n", num_sge); if (num_sge != 2) { @@ -1112,9 +1116,11 @@ int rdma_backend_get_gid_index(RdmaBackendDev *backend_dev, int rdma_backend_add_gid(RdmaBackendDev *backend_dev, const char *ifname, union ibv_gid *gid) { -RdmaCmMuxMsg msg = {0}; +RdmaCmMuxMsg msg; int ret; +memset(, 0, sizeof(msg)); + pr_dbg("0x%llx, 0x%llx\n", (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix), (long long unsigned int)be64_to_cpu(gid->global.interface_id)); @@ -1138,9 +1144,11 @@ int rdma_backend_add_gid(RdmaBackendDev *backend_dev, const char *ifname, int rdma_backend_del_gid(RdmaBackendDev *backend_dev, const char *ifname, union ibv_gid *gid) { -RdmaCmMuxMsg msg = {0}; +RdmaCmMuxMsg msg; int ret; +memset(, 0, sizeof(msg)); + pr_dbg("0x%llx, 0x%llx\n", (long long unsigned int)be64_to_cpu(gid->global.subnet_prefix), (long long unsigned int)be64_to_cpu(gid->global.interface_id)); -- 2.17.1
[Qemu-devel] [PATCH 1/3] contrib/rdmacm-mux: remove Wno-format-truncation flag
The flag is not recognized by some CLANG versions. Add proper constraints in code instead. Signed-off-by: Marcel Apfelbaum --- contrib/rdmacm-mux/Makefile.objs | 2 +- contrib/rdmacm-mux/main.c| 6 -- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/contrib/rdmacm-mux/Makefile.objs b/contrib/rdmacm-mux/Makefile.objs index be3eacb6f7..e1ff4fe569 100644 --- a/contrib/rdmacm-mux/Makefile.objs +++ b/contrib/rdmacm-mux/Makefile.objs @@ -1,4 +1,4 @@ ifdef CONFIG_PVRDMA -CFLAGS += -libumad -Wno-format-truncation +CFLAGS += -libumad rdmacm-mux-obj-y = main.o endif diff --git a/contrib/rdmacm-mux/main.c b/contrib/rdmacm-mux/main.c index 835a7f9214..64676030c5 100644 --- a/contrib/rdmacm-mux/main.c +++ b/contrib/rdmacm-mux/main.c @@ -42,6 +42,8 @@ /* The below can be override by command line parameter */ #define UNIX_SOCKET_PATH "/var/run/rdmacm-mux" +/* Has format %s-%s-%d" -- */ +#define SOCKET_PATH_MAX (PATH_MAX - NAME_MAX - sizeof(int) - 2) #define RDMA_PORT_NUM 1 typedef struct RdmaCmServerArgs { @@ -95,7 +97,7 @@ static void help(const char *progname) static void parse_args(int argc, char *argv[]) { int c; -char unix_socket_path[PATH_MAX]; +char unix_socket_path[SOCKET_PATH_MAX]; strcpy(server.args.rdma_dev_name, ""); strcpy(unix_socket_path, UNIX_SOCKET_PATH); @@ -113,7 +115,7 @@ static void parse_args(int argc, char *argv[]) case 's': /* This is temporary, final name will build below */ -strncpy(unix_socket_path, optarg, PATH_MAX); +strncpy(unix_socket_path, optarg, SOCKET_PATH_MAX); break; case 'p': -- 2.17.1
[Qemu-devel] [PATCH 3/3] contrib/rdmacm-mux: fix clang compilation
Fix Commit a5d2f6f877 (contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer). The above commit introduces a new contrib target, adding a global dependency to libumad library in case pvrdma configuration option is enabled. Clang forbids it: clang-6.0: error: -libumad: 'linker' input unused [-Werror,-Wunused-command-line-argument] Fix by limiting the scope to the rdmacm-mux target itself. Reported-by: Cornelia Huck Signed-off-by: Marcel Apfelbaum --- Makefile | 2 ++ contrib/rdmacm-mux/Makefile.objs | 1 - 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/Makefile b/Makefile index a9ac16d94e..31e87e0c2d 100644 --- a/Makefile +++ b/Makefile @@ -580,6 +580,8 @@ vhost-user-scsi$(EXESUF): $(vhost-user-scsi-obj-y) libvhost-user.a $(call LINK, $^) vhost-user-blk$(EXESUF): $(vhost-user-blk-obj-y) libvhost-user.a $(call LINK, $^) + +rdmacm-mux$(EXESUF): LIBS += "-libumad" rdmacm-mux$(EXESUF): $(rdmacm-mux-obj-y) $(COMMON_LDADDS) $(call LINK, $^) diff --git a/contrib/rdmacm-mux/Makefile.objs b/contrib/rdmacm-mux/Makefile.objs index e1ff4fe569..3df744af89 100644 --- a/contrib/rdmacm-mux/Makefile.objs +++ b/contrib/rdmacm-mux/Makefile.objs @@ -1,4 +1,3 @@ ifdef CONFIG_PVRDMA -CFLAGS += -libumad rdmacm-mux-obj-y = main.o endif -- 2.17.1
[Qemu-devel] [PATCH 0/3] contrib/rdmacm-mux: fix clang compilation
Fix Commit a5d2f6f877 (contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer). The above commit introduces a new contrib target, adding a global dependency to libumad library in case pvrdma configuration option is enabled. Clang forbids it: clang-6.0: error: -libumad: 'linker' input unused [-Werror,-Wunused-command-line-argument] Fix by limiting the scope to the rdmacm-mux target itself. Fix related clang errors while at it. Marcel Apfelbaum (3): contrib/rdmacm-mux: remove Wno-format-truncation flag hw/rdma: modify struct initialization contrib/rdmacm-mux: fix clang compilation Makefile | 2 ++ contrib/rdmacm-mux/Makefile.objs | 1 - contrib/rdmacm-mux/main.c| 18 +- hw/rdma/rdma_backend.c | 16 4 files changed, 27 insertions(+), 10 deletions(-) -- 2.17.1
[Qemu-devel] QEMU 3.1 - Getting Instruction Traces
Dear Developers, I am an undergraduate student trying to generate *a trace of all instructions executed by the QEMU ARM emulator when it runs a given binary*. I have been using the* following command* (from an out-of-tree build from source code): arm-linux-user/qemu-arm -d in_asm ~/Desktop/hello (hello is the name of my binary cross-compiled for ARM using gcc-arm-none-eabi on an x86 host machine). By doing this, I do get a list of translation blocks that are executed by the emulator, however I was told by a colleague that this trace/log is incomplete because of the following reason: the translation blocks already in the translation cache are not translated again, and hence are not logged to the screen by QEMU. Hence, I am now stuck and r*equire a way to generate a complete instruction trace of all instructions that are executed by the ARM emulator for a given binary*. Has anyone done the same earlier and can guide me on how to get this trace? Thank you and best regards, Dhruv
Re: [Qemu-devel] [PATCH PULL 02/31] contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer
On 1/8/19 11:18 AM, Cornelia Huck wrote: On Tue, 8 Jan 2019 10:41:38 +0200 Marcel Apfelbaum wrote: On 1/7/19 7:54 PM, Cornelia Huck wrote: On Mon, 7 Jan 2019 19:28:10 +0200 Marcel Apfelbaum wrote: On 1/3/19 12:34 PM, Cornelia Huck wrote: clang-6.0: error: -libumad: 'linker' input unused [-Werror,-Wunused-command-line-argument] Is really strange, the rdma-mux is part of the contrib directory and is not even compiled by default. Explicitly passing --disable-pvrdma to configure disables the offending code. Let me know if you need more information. Can you please send the exact steps you are using to configure and compile QEMU, are you compiling on a x86 machine? Yes, this is on my laptop (up-to-date F28). I can reproduce on current master, configured with ../configure --target-list="s390x-softmmu s390x-linux-user i386-softmmu x86_64-softmmu cris-softmmu arm-softmmu ppc64-softmmu" --enable-linux-aio --enable-virtfs --enable-trace-backends=simple --cc=clang --enable-rdma --enable-pvrdma It compiles if I pass --disable-pvrdma instead. It also compiles if I drop --cc=clang. I'm building in a build subdirectory; I have already tried purging it before rebuilding, without luck. clang --version says clang version 6.0.1 (tags/RELEASE_601/final) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: /usr/bin Thanks for the information, I will look into it. Marcel Thanks. Another data point: I see the same on a Fedora 29 on s390x, so this does not seem to be a broken local setup (unless I managed to mess up both systems in the same way...) clang --version says clang version 7.0.0 (Fedora 7.0.0-2.fc29) Target: s390x-ibm-linux Thread model: posix InstalledDir: /usr/bin Hi Cornelia, I reproduced the issue and found the root cause: contrib/rdmacm-mux/Makefile.objs: ifdef CONFIG_PVRDMA CFLAGS += -libumad -Wno-format-truncation rdmacm-mux-obj-y = main.o endif The above file adds libumad dependency for all project targets even if not needed. Gcc allows that while clang forbids it. (is strange it passed patchew tests) I will come up with a fix soon. Thanks, Marcel
[Qemu-devel] [Bug 1811499] [NEW] qemu/net/colo-compare.c:288: possible pointless code duplication ?
Public bug reported: qemu/net/colo-compare.c:288] -> [qemu/net/colo-compare.c:296]: (style) The if condition is the same as the previous if condition Source code is if (ppkt->tcp_seq == spkt->tcp_seq && ppkt->seq_end == spkt->seq_end) { if (colo_compare_packet_payload(ppkt, spkt, ppkt->header_size, spkt->header_size, ppkt->payload_size)) { *mark = COLO_COMPARE_FREE_SECONDARY | COLO_COMPARE_FREE_PRIMARY; return true; } } if (ppkt->tcp_seq == spkt->tcp_seq && ppkt->seq_end == spkt->seq_end) { if (colo_compare_packet_payload(ppkt, spkt, ppkt->header_size, spkt->header_size, ppkt->payload_size)) { *mark = COLO_COMPARE_FREE_SECONDARY | COLO_COMPARE_FREE_PRIMARY; return true; } } Maybe the second block was supposed to be different ? ** Affects: qemu Importance: Undecided Status: New -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1811499 Title: qemu/net/colo-compare.c:288: possible pointless code duplication ? Status in QEMU: New Bug description: qemu/net/colo-compare.c:288] -> [qemu/net/colo-compare.c:296]: (style) The if condition is the same as the previous if condition Source code is if (ppkt->tcp_seq == spkt->tcp_seq && ppkt->seq_end == spkt->seq_end) { if (colo_compare_packet_payload(ppkt, spkt, ppkt->header_size, spkt->header_size, ppkt->payload_size)) { *mark = COLO_COMPARE_FREE_SECONDARY | COLO_COMPARE_FREE_PRIMARY; return true; } } if (ppkt->tcp_seq == spkt->tcp_seq && ppkt->seq_end == spkt->seq_end) { if (colo_compare_packet_payload(ppkt, spkt, ppkt->header_size, spkt->header_size, ppkt->payload_size)) { *mark = COLO_COMPARE_FREE_SECONDARY | COLO_COMPARE_FREE_PRIMARY; return true; } } Maybe the second block was supposed to be different ? To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/1811499/+subscriptions
[Qemu-devel] [PATCH] KVM: MMU: fast cleanup D bit based on fast write protect
From: Zhuang Yanying Recently I tested live-migration with large-memory guests, find vcpu may hang for a long time while starting migration, such as 9s for 2048G(linux-4.20.1+qemu-3.1.0). The reason is memory_global_dirty_log_start() taking too long, and the vcpu is waiting for BQL. The page-by-page D bit clearup is the main time consumption. I think that the idea of "KVM: MMU: fast write protect" by xiaoguangrong, especially the function kvm_mmu_write_protect_all_pages(), is very helpful. After a little modifcation, on his patch, can solve this problem, 9s to 0.5s. At the begining of live migration, write protection is only applied to the top-level SPTE. Then the write from vm trigger the EPT violation, with for_each_shadow_entry write protection is performed at dirct_map. Finally the Dirty bit of the target page(at level 1 page table) is cleared, and the dirty page tracking is started. Of coure, the page where GPA is located is marked dirty when mmu_set_spte. A similar implementation on xen, just emt instead of write protection. What do you think about this solution? --- mmu.c | 5 - vmx.c | 3 +-- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/mmu.c b/mmu.c index b079d74..f49d316 100755 --- a/mmu.c +++ b/mmu.c @@ -3210,7 +3210,10 @@ static bool mmu_load_shadow_page(struct kvm *kvm, struct kvm_mmu_page *sp) break; if (is_last_spte(spte, sp->role.level)) { - flush |= spte_write_protect(sptep, false); + if (sp->role.level == PT_PAGE_TABLE_LEVEL) + flush |= spte_clear_dirty(sptep); + else + flush |= spte_write_protect(sptep, false); continue; } diff --git a/vmx.c b/vmx.c index 95784bc..7ec717f 100755 --- a/vmx.c +++ b/vmx.c @@ -14421,8 +14421,7 @@ static void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu) static void vmx_slot_enable_log_dirty(struct kvm *kvm, struct kvm_memory_slot *slot) { - kvm_mmu_slot_leaf_clear_dirty(kvm, slot); - kvm_mmu_slot_largepage_remove_write_access(kvm, slot); + kvm_mmu_write_protect_all_pages(kvm, true); } static void vmx_slot_disable_log_dirty(struct kvm *kvm, -- 1.8.3.1
Re: [Qemu-devel] [QEMU-devel][PATCH v4 0/2] Fix concurrent aio_poll/set_fd_handler.
On Thu, Dec 20, 2018 at 04:20:28PM +0100, remy.n...@blade-group.com wrote: > From: Remy Noel > > It is possible for an io_poll/read/write callback to be concurrently executed > along > with an aio_set_fd_handlers. This can cause all sorts of problems, like > a NULL callback or a bad opaque pointer. > > V2: > * Do not use RCU anymore as it inccurs a performance loss > V3: > * Don't drop revents when a handler is modified [Stefan] > V4: > * Unregister fd from ctx epoll when removing fd_handler [Paolo] > > Remy Noel (2): > aio-posix: Unregister fd from ctx epoll when removing fd_handler. > aio-posix: Fix concurrent aio_poll/set_fd_handler. > > util/aio-posix.c | 90 +--- > util/aio-win32.c | 67 --- > 2 files changed, 84 insertions(+), 73 deletions(-) > > -- > 2.19.2 > Thanks, applied to my block tree: https://github.com/stefanha/qemu/commits/block Stefan signature.asc Description: PGP signature