Re: [Qemu-block] [Qemu-devel] [Xen-devel] Question about xen disk unplug support for ahci missed in qemu
On 10/16/15 04:38, Kevin O'Connor wrote: > On Fri, Oct 16, 2015 at 01:10:54AM +0200, Laszlo Ersek wrote: >> On 10/14/15 13:27, Ian Campbell wrote: >>> On Wed, 2015-10-14 at 12:06 +0100, Stefano Stabellini wrote: > Can't you just teach SeaBIOS how to deal with your PV disks and then > only add that to your VM and forget about IDE/AHCI? I mean, that's how > it's done for virtio-blk, and it doesn't involve any insanities like > ripping out non-hotpluggable devices. Teaching SeaBIOS to deal with PV disks can be done, in fact we already support PV disks in OVMF. It is possible to boot Windows with OVMF without any IDE disks (patch pending for libxl to create a VM without emulated IDE disks). >>> >>> One stumbling block in the past has been how to know when the PV drivers in >>> the BIOS are no longer required, such that the ring can be torn down and/or >>> the connection etc handed over to the OS driver. > [...] >>> AFAIK the BIOS interfaces do not have anything as reliable as that. >>> >>> How does virtio deal with this in the BIOS case? >> >> It doesn't, as far as I can tell. >> >> I don't think it has to, though! On a BIOS box, you can always boot DOS, >> or another operating system that continues to use the BIOS interfaces >> forever. (Same as if you never call ExitBootServices() in UEFI.) >> >> Given that no starter pistol gets fired between the firmware and the OS >> on such a platform, they must always respect each other. I guess this >> could occur through the E820 map, or some such. > > One can use the "ACPI enable" SMI event to detect this if they really > wanted to. In SeaBIOS one could do this from > src/fw/smm.c:handle_smi() - however, no other drivers need this > notification today and it would be a bit ugly to have to handle it > from an SMI. (Assuming Xen were to support SMIs.) > >> No clue in what kind of E820 memory SeaBIOS allocates the virtio rings, >> but I guess the Linux kernel stays away from those areas until it's past >> device probing and binding. > > In SeaBIOS, the virtio memory is allocated from reserved memory. Perfect! That gives Xen drivers precedence to do the same. > (See > the memalign_high() call in src/hw/virtio-pci.c - the "high" memory > zone is taken from reserved memory: > http://seabios.org/Memory_Model#Memory_available_during_initialization > ) > > What's the reason for the "stumbling block" that requires the BIOS to > tear down the Xen ring prior to the OS being able to replace it? The > BIOS disk calls are all synchronous, so the ring wont be active when > the OS brings up its own ring. Yes, that's an argument that works well in practice. However... > Is there some low-level interaction > that prevents the OS from just resetting the ring prior to enabling > it? the assumption was that the ring would be placed into normal memory. If GRUB or the kernel overwrote the memory (reallocating the same pages for completely unrelated purposes) that used to contain the ring while SeaBIOS was serving requests, the hypervisor would be allowed to notice and act upon writes to those pages *without* any explicit "kick" (= guest-to-host notification). The hypervisor is allowed to look at the ring any time it wishes, so guest code uses barriers while populating the ring, and kicks the hypervisor "just in case it's not looking right now". But if the firmware's ring is in reserved memory, then the OS will stay away forever. That's great -- it answers the question for virtio, and should also guide a Xen PV driver implementation. Thanks! Laszlo > > -Kevin >
Re: [Qemu-block] [Qemu-devel] [Xen-devel] Question about xen disk unplug support for ahci missed in qemu
On Fri, 16 Oct 2015, Laszlo Ersek wrote: > On 10/16/15 11:06, Stefano Stabellini wrote: > > On Thu, 15 Oct 2015, Kevin O'Connor wrote: > >> On Fri, Oct 16, 2015 at 01:10:54AM +0200, Laszlo Ersek wrote: > >>> On 10/14/15 13:27, Ian Campbell wrote: > On Wed, 2015-10-14 at 12:06 +0100, Stefano Stabellini wrote: > >> Can't you just teach SeaBIOS how to deal with your PV disks and then > >> only add that to your VM and forget about IDE/AHCI? I mean, that's how > >> it's done for virtio-blk, and it doesn't involve any insanities like > >> ripping out non-hotpluggable devices. > > > > Teaching SeaBIOS to deal with PV disks can be done, in fact we already > > support PV disks in OVMF. It is possible to boot Windows with OVMF > > without any IDE disks (patch pending for libxl to create a VM without > > emulated IDE disks). > > One stumbling block in the past has been how to know when the PV drivers > in > the BIOS are no longer required, such that the ring can be torn down > and/or > the connection etc handed over to the OS driver. > >> [...] > AFAIK the BIOS interfaces do not have anything as reliable as that. > > How does virtio deal with this in the BIOS case? > >>> > >>> It doesn't, as far as I can tell. > >>> > >>> I don't think it has to, though! On a BIOS box, you can always boot DOS, > >>> or another operating system that continues to use the BIOS interfaces > >>> forever. (Same as if you never call ExitBootServices() in UEFI.) > >>> > >>> Given that no starter pistol gets fired between the firmware and the OS > >>> on such a platform, they must always respect each other. I guess this > >>> could occur through the E820 map, or some such. > >> > >> One can use the "ACPI enable" SMI event to detect this if they really > >> wanted to. In SeaBIOS one could do this from > >> src/fw/smm.c:handle_smi() - however, no other drivers need this > >> notification today and it would be a bit ugly to have to handle it > >> from an SMI. (Assuming Xen were to support SMIs.) > >> > >>> No clue in what kind of E820 memory SeaBIOS allocates the virtio rings, > >>> but I guess the Linux kernel stays away from those areas until it's past > >>> device probing and binding. > >> > >> In SeaBIOS, the virtio memory is allocated from reserved memory. (See > >> the memalign_high() call in src/hw/virtio-pci.c - the "high" memory > >> zone is taken from reserved memory: > >> http://seabios.org/Memory_Model#Memory_available_during_initialization > >> ) > >> > >> What's the reason for the "stumbling block" that requires the BIOS to > >> tear down the Xen ring prior to the OS being able to replace it? The > >> BIOS disk calls are all synchronous, so the ring wont be active when > >> the OS brings up its own ring. Is there some low-level interaction > >> that prevents the OS from just resetting the ring prior to enabling > >> it? > > > > Xen only exports one PV disk interface for each disk to the guest, and > > each PV interface only supports one frontend -- only SeaBIOS or the OS > > can be connected to one PV disk, not both. In the case of OVMF, we > > handle that by disconnecting the PV frontend in OVMF when > > ExitBootServices is called, so that the OS driver can reconnect later. > > Does the XenBus protocol support a device reset operation, regardless of > what state the device is currently in? (I don't remember all the state > transitions any longer, sorry.) The PV block protocol doesn't unfortunately. At least the block backend in QEMU doesn't support it.
Re: [Qemu-block] [Qemu-devel] [Xen-devel] Question about xen disk unplug support for ahci missed in qemu
On 10/16/15 11:06, Stefano Stabellini wrote: > On Thu, 15 Oct 2015, Kevin O'Connor wrote: >> On Fri, Oct 16, 2015 at 01:10:54AM +0200, Laszlo Ersek wrote: >>> On 10/14/15 13:27, Ian Campbell wrote: On Wed, 2015-10-14 at 12:06 +0100, Stefano Stabellini wrote: >> Can't you just teach SeaBIOS how to deal with your PV disks and then >> only add that to your VM and forget about IDE/AHCI? I mean, that's how >> it's done for virtio-blk, and it doesn't involve any insanities like >> ripping out non-hotpluggable devices. > > Teaching SeaBIOS to deal with PV disks can be done, in fact we already > support PV disks in OVMF. It is possible to boot Windows with OVMF > without any IDE disks (patch pending for libxl to create a VM without > emulated IDE disks). One stumbling block in the past has been how to know when the PV drivers in the BIOS are no longer required, such that the ring can be torn down and/or the connection etc handed over to the OS driver. >> [...] AFAIK the BIOS interfaces do not have anything as reliable as that. How does virtio deal with this in the BIOS case? >>> >>> It doesn't, as far as I can tell. >>> >>> I don't think it has to, though! On a BIOS box, you can always boot DOS, >>> or another operating system that continues to use the BIOS interfaces >>> forever. (Same as if you never call ExitBootServices() in UEFI.) >>> >>> Given that no starter pistol gets fired between the firmware and the OS >>> on such a platform, they must always respect each other. I guess this >>> could occur through the E820 map, or some such. >> >> One can use the "ACPI enable" SMI event to detect this if they really >> wanted to. In SeaBIOS one could do this from >> src/fw/smm.c:handle_smi() - however, no other drivers need this >> notification today and it would be a bit ugly to have to handle it >> from an SMI. (Assuming Xen were to support SMIs.) >> >>> No clue in what kind of E820 memory SeaBIOS allocates the virtio rings, >>> but I guess the Linux kernel stays away from those areas until it's past >>> device probing and binding. >> >> In SeaBIOS, the virtio memory is allocated from reserved memory. (See >> the memalign_high() call in src/hw/virtio-pci.c - the "high" memory >> zone is taken from reserved memory: >> http://seabios.org/Memory_Model#Memory_available_during_initialization >> ) >> >> What's the reason for the "stumbling block" that requires the BIOS to >> tear down the Xen ring prior to the OS being able to replace it? The >> BIOS disk calls are all synchronous, so the ring wont be active when >> the OS brings up its own ring. Is there some low-level interaction >> that prevents the OS from just resetting the ring prior to enabling >> it? > > Xen only exports one PV disk interface for each disk to the guest, and > each PV interface only supports one frontend -- only SeaBIOS or the OS > can be connected to one PV disk, not both. In the case of OVMF, we > handle that by disconnecting the PV frontend in OVMF when > ExitBootServices is called, so that the OS driver can reconnect later. Does the XenBus protocol support a device reset operation, regardless of what state the device is currently in? (I don't remember all the state transitions any longer, sorry.) Thanks Laszlo
Re: [Qemu-block] [Qemu-devel] [Xen-devel] Question about xen disk unplug support for ahci missed in qemu
On Thu, 15 Oct 2015, Kevin O'Connor wrote: > On Fri, Oct 16, 2015 at 01:10:54AM +0200, Laszlo Ersek wrote: > > On 10/14/15 13:27, Ian Campbell wrote: > > > On Wed, 2015-10-14 at 12:06 +0100, Stefano Stabellini wrote: > > >>> Can't you just teach SeaBIOS how to deal with your PV disks and then > > >>> only add that to your VM and forget about IDE/AHCI? I mean, that's how > > >>> it's done for virtio-blk, and it doesn't involve any insanities like > > >>> ripping out non-hotpluggable devices. > > >> > > >> Teaching SeaBIOS to deal with PV disks can be done, in fact we already > > >> support PV disks in OVMF. It is possible to boot Windows with OVMF > > >> without any IDE disks (patch pending for libxl to create a VM without > > >> emulated IDE disks). > > > > > > One stumbling block in the past has been how to know when the PV drivers > > > in > > > the BIOS are no longer required, such that the ring can be torn down > > > and/or > > > the connection etc handed over to the OS driver. > [...] > > > AFAIK the BIOS interfaces do not have anything as reliable as that. > > > > > > How does virtio deal with this in the BIOS case? > > > > It doesn't, as far as I can tell. > > > > I don't think it has to, though! On a BIOS box, you can always boot DOS, > > or another operating system that continues to use the BIOS interfaces > > forever. (Same as if you never call ExitBootServices() in UEFI.) > > > > Given that no starter pistol gets fired between the firmware and the OS > > on such a platform, they must always respect each other. I guess this > > could occur through the E820 map, or some such. > > One can use the "ACPI enable" SMI event to detect this if they really > wanted to. In SeaBIOS one could do this from > src/fw/smm.c:handle_smi() - however, no other drivers need this > notification today and it would be a bit ugly to have to handle it > from an SMI. (Assuming Xen were to support SMIs.) > > > No clue in what kind of E820 memory SeaBIOS allocates the virtio rings, > > but I guess the Linux kernel stays away from those areas until it's past > > device probing and binding. > > In SeaBIOS, the virtio memory is allocated from reserved memory. (See > the memalign_high() call in src/hw/virtio-pci.c - the "high" memory > zone is taken from reserved memory: > http://seabios.org/Memory_Model#Memory_available_during_initialization > ) > > What's the reason for the "stumbling block" that requires the BIOS to > tear down the Xen ring prior to the OS being able to replace it? The > BIOS disk calls are all synchronous, so the ring wont be active when > the OS brings up its own ring. Is there some low-level interaction > that prevents the OS from just resetting the ring prior to enabling > it? Xen only exports one PV disk interface for each disk to the guest, and each PV interface only supports one frontend -- only SeaBIOS or the OS can be connected to one PV disk, not both. In the case of OVMF, we handle that by disconnecting the PV frontend in OVMF when ExitBootServices is called, so that the OS driver can reconnect later.
Re: [Qemu-block] [Qemu-devel] [Xen-devel] Question about xen disk unplug support for ahci missed in qemu
On Fri, 2015-10-16 at 10:06 +0100, Stefano Stabellini wrote: > > What's the reason for the "stumbling block" that requires the BIOS to > > tear down the Xen ring prior to the OS being able to replace it? The > > BIOS disk calls are all synchronous, so the ring wont be active when > > the OS brings up its own ring. Is there some low-level interaction > > that prevents the OS from just resetting the ring prior to enabling > > it? > > Xen only exports one PV disk interface for each disk to the guest, and > each PV interface only supports one frontend -- only SeaBIOS or the OS > can be connected to one PV disk, not both. Which I think is just another way of saying that the Xen PV protocol currently lacks an explicit requirement for the OS to reset the device (or indeed the general PV infrastructure, grant tables etc) before use. Retrofitting that requirement is of course a little tricky. The unplug protocol might be extensible neough though. IIRC it does include provisions for the OS to specify a version and the reject the unplug, so upreving that to include a reset requirement _might_ be possible. At which point it can at least be made a config option which can be switch on for new enough guests. i.e. if the guest is configured to use PV drivers from SeaBIOS the unplug protocol would reject the attempt to unplug the (non-existent) IDE devices and the guest therefore should fail to bind to the PV devices, while a newer guest which knows it has to do a reset would declare itself to be newer and succeed in the unplug. (NB: details of the protocol are sketchy in my memory, and the above may need actual though applied to make it practical, but you get the gist I hope). Then you are just into some sort of multiyear transition/deprecation sequence before you make it the default. > In the case of OVMF, we > handle that by disconnecting the PV frontend in OVMF when > ExitBootServices is called, so that the OS driver can reconnect later.
Re: [Qemu-block] [Qemu-devel] [Xen-devel] Question about xen disk unplug support for ahci missed in qemu
On Fri, Oct 16, 2015 at 10:06:48AM +0100, Stefano Stabellini wrote: > On Thu, 15 Oct 2015, Kevin O'Connor wrote: > > What's the reason for the "stumbling block" that requires the BIOS to > > tear down the Xen ring prior to the OS being able to replace it? The > > BIOS disk calls are all synchronous, so the ring wont be active when > > the OS brings up its own ring. Is there some low-level interaction > > that prevents the OS from just resetting the ring prior to enabling > > it? > > Xen only exports one PV disk interface for each disk to the guest, and > each PV interface only supports one frontend -- only SeaBIOS or the OS > can be connected to one PV disk, not both. In the case of OVMF, we > handle that by disconnecting the PV frontend in OVMF when > ExitBootServices is called, so that the OS driver can reconnect later. Well, there isn't a requirement for both SeaBIOS and the OS to be connected at the same time - it's fine for the OS to replace SeaBIOS. With the hardware I'm familiar with (eg, usb, ahci, virtio) the OS just ends up replacing SeaBIOS' DMA rings when it configures its own. I guess something in the low-level interface of Xen makes that not work. Is plugging/unplugging very high overhead? Since the SeaBIOS disk interface is fully synchronous, in theory one could have it plug/unplug on every read request. -Kevin
Re: [Qemu-block] [Qemu-devel] [Xen-devel] Question about xen disk unplug support for ahci missed in qemu
CC'ing Kevin O'Connor On 10/14/15 13:27, Ian Campbell wrote: > On Wed, 2015-10-14 at 12:06 +0100, Stefano Stabellini wrote: >>> Can't you just teach SeaBIOS how to deal with your PV disks and then >>> only add that to your VM and forget about IDE/AHCI? I mean, that's how >>> it's done for virtio-blk, and it doesn't involve any insanities like >>> ripping out non-hotpluggable devices. >> >> Teaching SeaBIOS to deal with PV disks can be done, in fact we already >> support PV disks in OVMF. It is possible to boot Windows with OVMF >> without any IDE disks (patch pending for libxl to create a VM without >> emulated IDE disks). > > One stumbling block in the past has been how to know when the PV drivers in > the BIOS are no longer required, such that the ring can be torn down and/or > the connection etc handed over to the OS driver. > > I think we deal with this in OVMF using ExitBootServices? (TBH I'm not sure > how). Search "XenBusDxe/XenBusDxe.c" in edk2 for "EVT_SIGNAL_EXIT_BOOT_SERVICES". TBH, the code in NotifyExitBoot() doesn't seem valid. If you check the UEFI spec (for example, v2.5, but the requirement I'm about to quote is very old), in the specification of EFI_BOOT_SERVICES.CreateEvent() you find: EVT_SIGNAL_EXIT_BOOT_SERVICES This event is to be notified by the system when ExitBootServices() is invoked. This event is of type EVT_NOTIFY_SIGNAL and should not be combined with any other event types. The notification function for this event is not allowed to use the Memory Allocation Services, or call any functions that use the Memory Allocation Services and must only call functions that are known not to use Memory Allocation Services, because these services modify the current memory map.The notification function must not depend on timer events since timer services will be deactivated before any notification functions are called. NotifyExitBoot() in "XenBusDxe/XenBusDxe.c" calls the DisconnectController() boot service. That in turn leads to calls to EFI_DRIVER_BINDING_PROTOCOL.Stop() functions (speaking generally), which inevitably free memory as part of unbinding the device, thereby breaking the above requirement. The right solution is the following: - when a driver binds a device (a "handle"), a piece of the resources allocated for that binding should be a new event, to be signaled at ExitBootServices() time. The handler function can be shared by all such devices. The context passed to the handler should be the (driver- specific) structure that represents the binding and the state of the device in general. - When the driver unbinds the device, the event should be closed. This will automatically unregister the callback as well. - Now, when the callback is entered at all, you can be sure that the binding still exists. In this case, you should probe into the various fields of the context (the device state, practically), to figure out if this device "lives" or is dormant. For simpler devices, the answer is always "alive", but some devices could have configuration states where they are bound yet not configured (using no hw resources etc). - In case the device is alive, the action to take is to make it abort any in-flight transfers or other operations, and re-set / deconfigure it *without* touching any memory allocations. You can see this in the virtio-net driver in OVMF. In the "OvmfPkg/VirtioNetDxe" directory, see the "TechNotes.txt", "DriverBinding.c" and "Events.c" files. The callback function is VirtioNetExitBoot(), and the event registration / deregistration happens in: VirtioNetDriverBindingStart() VirtioNetSnpPopulate() gBS->CreateEvent(EVT_SIGNAL_EXIT_BOOT_SERVICES) vs. VirtioNetDriverBindingStop() VirtioNetSnpEvacuate() gBS->CloseEvent() What VirtioNetExitBoot() does is very simple: resetting the virtio device is a small action, and it covers the responsibilities. I'll admit that the virtio-scsi and virtio-block drivers play a bit dirty here. (I've known this for a long time, but been silent about it.) They should have similar callbacks, but don't (In theory, all devices that are bound & alive at ExitBootServices() should be re-set, without touching the memory services.) There are two mitigating factors here: - unlike with virtio-net, the scsi and block drivers in OVMF support only synchronous operations. When you are not calling their functions, there are no transfers in flight. And when something calls ExitBootServices(), that thing is not calling virtio-block / virtio-scsi functions. - The first thing the OS-level virtio drivers do (certainly in Linux, hopefully in Windows) is a virtio-reset on each virtio device found. (This is actually required by the virtio specification, both old and new.) Now, there's one small window for issues here. If something in the guest scribbled over the memory that, according to QEMU, still hosts a live virtio ring, between ExitBootServices() and
Re: [Qemu-block] [Qemu-devel] [Xen-devel] Question about xen disk unplug support for ahci missed in qemu
On Fri, Oct 16, 2015 at 01:10:54AM +0200, Laszlo Ersek wrote: > On 10/14/15 13:27, Ian Campbell wrote: > > On Wed, 2015-10-14 at 12:06 +0100, Stefano Stabellini wrote: > >>> Can't you just teach SeaBIOS how to deal with your PV disks and then > >>> only add that to your VM and forget about IDE/AHCI? I mean, that's how > >>> it's done for virtio-blk, and it doesn't involve any insanities like > >>> ripping out non-hotpluggable devices. > >> > >> Teaching SeaBIOS to deal with PV disks can be done, in fact we already > >> support PV disks in OVMF. It is possible to boot Windows with OVMF > >> without any IDE disks (patch pending for libxl to create a VM without > >> emulated IDE disks). > > > > One stumbling block in the past has been how to know when the PV drivers in > > the BIOS are no longer required, such that the ring can be torn down and/or > > the connection etc handed over to the OS driver. [...] > > AFAIK the BIOS interfaces do not have anything as reliable as that. > > > > How does virtio deal with this in the BIOS case? > > It doesn't, as far as I can tell. > > I don't think it has to, though! On a BIOS box, you can always boot DOS, > or another operating system that continues to use the BIOS interfaces > forever. (Same as if you never call ExitBootServices() in UEFI.) > > Given that no starter pistol gets fired between the firmware and the OS > on such a platform, they must always respect each other. I guess this > could occur through the E820 map, or some such. One can use the "ACPI enable" SMI event to detect this if they really wanted to. In SeaBIOS one could do this from src/fw/smm.c:handle_smi() - however, no other drivers need this notification today and it would be a bit ugly to have to handle it from an SMI. (Assuming Xen were to support SMIs.) > No clue in what kind of E820 memory SeaBIOS allocates the virtio rings, > but I guess the Linux kernel stays away from those areas until it's past > device probing and binding. In SeaBIOS, the virtio memory is allocated from reserved memory. (See the memalign_high() call in src/hw/virtio-pci.c - the "high" memory zone is taken from reserved memory: http://seabios.org/Memory_Model#Memory_available_during_initialization ) What's the reason for the "stumbling block" that requires the BIOS to tear down the Xen ring prior to the OS being able to replace it? The BIOS disk calls are all synchronous, so the ring wont be active when the OS brings up its own ring. Is there some low-level interaction that prevents the OS from just resetting the ring prior to enabling it? -Kevin