Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
On 1/27/21 12:45 PM, Cornelia Huck wrote: On Wed, 27 Jan 2021 08:53:05 -0700 Alex Williamson wrote: On Wed, 27 Jan 2021 09:23:04 -0500 Matthew Rosato wrote: On 1/26/21 6:18 PM, Alex Williamson wrote: On Mon, 25 Jan 2021 09:40:38 -0500 Matthew Rosato wrote: On 1/22/21 6:48 PM, Alex Williamson wrote: On Tue, 19 Jan 2021 15:02:30 -0500 Matthew Rosato wrote: Some s390 PCI devices (e.g. ISM) perform I/O operations that have very specific requirements in terms of alignment as well as the patterns in which the data is read/written. Allowing these to proceed through the typical vfio_pci_bar_rw path will cause them to be broken in up in such a way that these requirements can't be guaranteed. In addition, ISM devices do not support the MIO codepaths that might be triggered on vfio I/O coming from userspace; we must be able to ensure that these devices use the non-MIO instructions. To facilitate this, provide a new vfio region by which non-MIO instructions can be passed directly to the host kernel s390 PCI layer, to be reliably issued as non-MIO instructions. This patch introduces the new vfio VFIO_REGION_SUBTYPE_IBM_ZPCI_IO region and implements the ability to pass PCISTB and PCILG instructions over it, as these are what is required for ISM devices. There have been various discussions about splitting vfio-pci to allow more device specific drivers rather adding duct tape and bailing wire for various device specific features to extend vfio-pci. The latest iteration is here[1]. Is it possible that such a solution could simply provide the standard BAR region indexes, but with an implementation that works on s390, rather than creating new device specific regions to perform the same task? Thanks, Alex [1]https://lore.kernel.org/lkml/20210117181534.65724-1-mgurto...@nvidia.com/ Thanks for the pointer, I'll have to keep an eye on this. An approach like this could solve some issues, but I think a main issue that still remains with relying on the standard BAR region indexes (whether using the current vfio-pci driver or a device-specific driver) is that QEMU writes to said BAR memory region are happening in, at most, 8B chunks (which then, in the current general-purpose vfio-pci code get further split up into 4B iowrite operations). The alternate approach I'm proposing here is allowing for the whole payload (4K) in a single operation, which is significantly faster. So, I suspect even with a device specific driver we'd want this sort of a region anyhow.. Why is this device specific behavior? It would be a fair argument that acceptable device access widths for MMIO are always device specific, so we should never break them down. Looking at the PCI spec, a TLP requires a dword (4-byte) aligned address with a 10-bit length field > indicating the number of dwords, so up to 4K data as you suggest is the Well, as I mentioned in a different thread, it's not really device Sorry, I tried to follow the thread, not sure it's possible w/o lots of preexisting s390 knowledge. specific behavior but rather architecture/s390x specific behavior; PCISTB is an interface available to all PCI devices on s390x, it just so happens that ISM is the first device type that is running into the nuance. The architecture is designed in such a way that other devices can use the same interface in the same manner. As a platform access mechanism, this leans towards a platform specific implementation of the PCI BAR regions. To drill down a bit, the PCISTB payload can either be 'strict' or 'relaxed', which via the s390x architecture 'relaxed' means it's not dword-aligned but rather byte-aligned up to 4K. We find out at init time which interface a device supports -- So, for a device that supports 'relaxed' PCISTB like ISM, an example would be a PCISTB of 11 bytes coming from a non-dword-aligned address is permissible, which doesn't match the TLP from the spec as you described... I believe this 'relaxed' operation that steps outside of the spec is unique to s390x. (Conversely, devices that use 'strict' PCISTB conform to the typical TLP definition) This deviation from spec is to my mind is another argument to treat these particular PCISTB separately. I think that's just an accessor abstraction, we're not asking users to generate TLPs. If we expose a byte granularity interface, some platforms might pass that directly to the PCISTB command, otherwise might align the address, perform a dword access, and return the requested byte. AFAICT, both conventional and express PCI use dword alignement on the bus, so this should be valid and at best questions whether ISM is really PCI or not. The vibes I'm getting from ISM is that it is mostly a construct using (one set of) the s390 pci instructions, which ends up being something not entirely unlike a pci device... the question is how much things Yep, that's a fair assessment. like the 'relaxed' payload may also be supported by 'real' pci devices plu
Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
On Wed, 27 Jan 2021 08:53:05 -0700 Alex Williamson wrote: > On Wed, 27 Jan 2021 09:23:04 -0500 > Matthew Rosato wrote: > > > On 1/26/21 6:18 PM, Alex Williamson wrote: > > > On Mon, 25 Jan 2021 09:40:38 -0500 > > > Matthew Rosato wrote: > > > > > >> On 1/22/21 6:48 PM, Alex Williamson wrote: > > >>> On Tue, 19 Jan 2021 15:02:30 -0500 > > >>> Matthew Rosato wrote: > > >>> > > Some s390 PCI devices (e.g. ISM) perform I/O operations that have very > > specific requirements in terms of alignment as well as the patterns in > > which the data is read/written. Allowing these to proceed through the > > typical vfio_pci_bar_rw path will cause them to be broken in up in > > such a > > way that these requirements can't be guaranteed. In addition, ISM > > devices > > do not support the MIO codepaths that might be triggered on vfio I/O > > coming > > from userspace; we must be able to ensure that these devices use the > > non-MIO instructions. To facilitate this, provide a new vfio region by > > which non-MIO instructions can be passed directly to the host kernel > > s390 > > PCI layer, to be reliably issued as non-MIO instructions. > > > > This patch introduces the new vfio VFIO_REGION_SUBTYPE_IBM_ZPCI_IO > > region > > and implements the ability to pass PCISTB and PCILG instructions over > > it, > > as these are what is required for ISM devices. > > >>> > > >>> There have been various discussions about splitting vfio-pci to allow > > >>> more device specific drivers rather adding duct tape and bailing wire > > >>> for various device specific features to extend vfio-pci. The latest > > >>> iteration is here[1]. Is it possible that such a solution could simply > > >>> provide the standard BAR region indexes, but with an implementation that > > >>> works on s390, rather than creating new device specific regions to > > >>> perform the same task? Thanks, > > >>> > > >>> Alex > > >>> > > >>> [1]https://lore.kernel.org/lkml/20210117181534.65724-1-mgurto...@nvidia.com/ > > >>> > > >> > > >> Thanks for the pointer, I'll have to keep an eye on this. An approach > > >> like this could solve some issues, but I think a main issue that still > > >> remains with relying on the standard BAR region indexes (whether using > > >> the current vfio-pci driver or a device-specific driver) is that QEMU > > >> writes to said BAR memory region are happening in, at most, 8B chunks > > >> (which then, in the current general-purpose vfio-pci code get further > > >> split up into 4B iowrite operations). The alternate approach I'm > > >> proposing here is allowing for the whole payload (4K) in a single > > >> operation, which is significantly faster. So, I suspect even with a > > >> device specific driver we'd want this sort of a region anyhow.. > > > > > > Why is this device specific behavior? It would be a fair argument that > > > acceptable device access widths for MMIO are always device specific, so > > > we should never break them down. Looking at the PCI spec, a TLP > > > requires a dword (4-byte) aligned address with a 10-bit length field > > > > indicating the number of dwords, so up to 4K data as you suggest is the > > > > > > > Well, as I mentioned in a different thread, it's not really device > > Sorry, I tried to follow the thread, not sure it's possible w/o lots of > preexisting s390 knowledge. > > > specific behavior but rather architecture/s390x specific behavior; > > PCISTB is an interface available to all PCI devices on s390x, it just so > > happens that ISM is the first device type that is running into the > > nuance. The architecture is designed in such a way that other devices > > can use the same interface in the same manner. > > As a platform access mechanism, this leans towards a platform specific > implementation of the PCI BAR regions. > > > To drill down a bit, the PCISTB payload can either be 'strict' or > > 'relaxed', which via the s390x architecture 'relaxed' means it's not > > dword-aligned but rather byte-aligned up to 4K. We find out at init > > time which interface a device supports -- So, for a device that > > supports 'relaxed' PCISTB like ISM, an example would be a PCISTB of 11 > > bytes coming from a non-dword-aligned address is permissible, which > > doesn't match the TLP from the spec as you described... I believe this > > 'relaxed' operation that steps outside of the spec is unique to s390x. > > (Conversely, devices that use 'strict' PCISTB conform to the typical TLP > > definition) > > > > This deviation from spec is to my mind is another argument to treat > > these particular PCISTB separately. > > I think that's just an accessor abstraction, we're not asking users to > generate TLPs. If we expose a byte granularity interface, some > platforms might pass that directly to the PCISTB command, otherwise > might ali
Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
On Wed, 27 Jan 2021 09:23:04 -0500 Matthew Rosato wrote: > On 1/26/21 6:18 PM, Alex Williamson wrote: > > On Mon, 25 Jan 2021 09:40:38 -0500 > > Matthew Rosato wrote: > > > >> On 1/22/21 6:48 PM, Alex Williamson wrote: > >>> On Tue, 19 Jan 2021 15:02:30 -0500 > >>> Matthew Rosato wrote: > >>> > Some s390 PCI devices (e.g. ISM) perform I/O operations that have very > specific requirements in terms of alignment as well as the patterns in > which the data is read/written. Allowing these to proceed through the > typical vfio_pci_bar_rw path will cause them to be broken in up in such a > way that these requirements can't be guaranteed. In addition, ISM devices > do not support the MIO codepaths that might be triggered on vfio I/O > coming > from userspace; we must be able to ensure that these devices use the > non-MIO instructions. To facilitate this, provide a new vfio region by > which non-MIO instructions can be passed directly to the host kernel s390 > PCI layer, to be reliably issued as non-MIO instructions. > > This patch introduces the new vfio VFIO_REGION_SUBTYPE_IBM_ZPCI_IO region > and implements the ability to pass PCISTB and PCILG instructions over it, > as these are what is required for ISM devices. > >>> > >>> There have been various discussions about splitting vfio-pci to allow > >>> more device specific drivers rather adding duct tape and bailing wire > >>> for various device specific features to extend vfio-pci. The latest > >>> iteration is here[1]. Is it possible that such a solution could simply > >>> provide the standard BAR region indexes, but with an implementation that > >>> works on s390, rather than creating new device specific regions to > >>> perform the same task? Thanks, > >>> > >>> Alex > >>> > >>> [1]https://lore.kernel.org/lkml/20210117181534.65724-1-mgurto...@nvidia.com/ > >>> > >> > >> Thanks for the pointer, I'll have to keep an eye on this. An approach > >> like this could solve some issues, but I think a main issue that still > >> remains with relying on the standard BAR region indexes (whether using > >> the current vfio-pci driver or a device-specific driver) is that QEMU > >> writes to said BAR memory region are happening in, at most, 8B chunks > >> (which then, in the current general-purpose vfio-pci code get further > >> split up into 4B iowrite operations). The alternate approach I'm > >> proposing here is allowing for the whole payload (4K) in a single > >> operation, which is significantly faster. So, I suspect even with a > >> device specific driver we'd want this sort of a region anyhow.. > > > > Why is this device specific behavior? It would be a fair argument that > > acceptable device access widths for MMIO are always device specific, so > > we should never break them down. Looking at the PCI spec, a TLP > > requires a dword (4-byte) aligned address with a 10-bit length field > > > indicating the number of dwords, so up to 4K data as you suggest is the > > Well, as I mentioned in a different thread, it's not really device Sorry, I tried to follow the thread, not sure it's possible w/o lots of preexisting s390 knowledge. > specific behavior but rather architecture/s390x specific behavior; > PCISTB is an interface available to all PCI devices on s390x, it just so > happens that ISM is the first device type that is running into the > nuance. The architecture is designed in such a way that other devices > can use the same interface in the same manner. As a platform access mechanism, this leans towards a platform specific implementation of the PCI BAR regions. > To drill down a bit, the PCISTB payload can either be 'strict' or > 'relaxed', which via the s390x architecture 'relaxed' means it's not > dword-aligned but rather byte-aligned up to 4K. We find out at init > time which interface a device supports -- So, for a device that > supports 'relaxed' PCISTB like ISM, an example would be a PCISTB of 11 > bytes coming from a non-dword-aligned address is permissible, which > doesn't match the TLP from the spec as you described... I believe this > 'relaxed' operation that steps outside of the spec is unique to s390x. > (Conversely, devices that use 'strict' PCISTB conform to the typical TLP > definition) > > This deviation from spec is to my mind is another argument to treat > these particular PCISTB separately. I think that's just an accessor abstraction, we're not asking users to generate TLPs. If we expose a byte granularity interface, some platforms might pass that directly to the PCISTB command, otherwise might align the address, perform a dword access, and return the requested byte. AFAICT, both conventional and express PCI use dword alignement on the bus, so this should be valid and at best questions whether ISM is really PCI or not. > > whole payload. It's quite possible that the reason we don't have more > > access wid
Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
On 1/26/21 6:18 PM, Alex Williamson wrote: On Mon, 25 Jan 2021 09:40:38 -0500 Matthew Rosato wrote: On 1/22/21 6:48 PM, Alex Williamson wrote: On Tue, 19 Jan 2021 15:02:30 -0500 Matthew Rosato wrote: Some s390 PCI devices (e.g. ISM) perform I/O operations that have very specific requirements in terms of alignment as well as the patterns in which the data is read/written. Allowing these to proceed through the typical vfio_pci_bar_rw path will cause them to be broken in up in such a way that these requirements can't be guaranteed. In addition, ISM devices do not support the MIO codepaths that might be triggered on vfio I/O coming from userspace; we must be able to ensure that these devices use the non-MIO instructions. To facilitate this, provide a new vfio region by which non-MIO instructions can be passed directly to the host kernel s390 PCI layer, to be reliably issued as non-MIO instructions. This patch introduces the new vfio VFIO_REGION_SUBTYPE_IBM_ZPCI_IO region and implements the ability to pass PCISTB and PCILG instructions over it, as these are what is required for ISM devices. There have been various discussions about splitting vfio-pci to allow more device specific drivers rather adding duct tape and bailing wire for various device specific features to extend vfio-pci. The latest iteration is here[1]. Is it possible that such a solution could simply provide the standard BAR region indexes, but with an implementation that works on s390, rather than creating new device specific regions to perform the same task? Thanks, Alex [1]https://lore.kernel.org/lkml/20210117181534.65724-1-mgurto...@nvidia.com/ Thanks for the pointer, I'll have to keep an eye on this. An approach like this could solve some issues, but I think a main issue that still remains with relying on the standard BAR region indexes (whether using the current vfio-pci driver or a device-specific driver) is that QEMU writes to said BAR memory region are happening in, at most, 8B chunks (which then, in the current general-purpose vfio-pci code get further split up into 4B iowrite operations). The alternate approach I'm proposing here is allowing for the whole payload (4K) in a single operation, which is significantly faster. So, I suspect even with a device specific driver we'd want this sort of a region anyhow.. Why is this device specific behavior? It would be a fair argument that acceptable device access widths for MMIO are always device specific, so we should never break them down. Looking at the PCI spec, a TLP requires a dword (4-byte) aligned address with a 10-bit length field > indicating the number of dwords, so up to 4K data as you suggest is the Well, as I mentioned in a different thread, it's not really device specific behavior but rather architecture/s390x specific behavior; PCISTB is an interface available to all PCI devices on s390x, it just so happens that ISM is the first device type that is running into the nuance. The architecture is designed in such a way that other devices can use the same interface in the same manner. To drill down a bit, the PCISTB payload can either be 'strict' or 'relaxed', which via the s390x architecture 'relaxed' means it's not dword-aligned but rather byte-aligned up to 4K. We find out at init time which interface a device supports -- So, for a device that supports 'relaxed' PCISTB like ISM, an example would be a PCISTB of 11 bytes coming from a non-dword-aligned address is permissible, which doesn't match the TLP from the spec as you described... I believe this 'relaxed' operation that steps outside of the spec is unique to s390x. (Conversely, devices that use 'strict' PCISTB conform to the typical TLP definition) This deviation from spec is to my mind is another argument to treat these particular PCISTB separately. whole payload. It's quite possible that the reason we don't have more access width problems is that MMIO is typically mmap'd on other platforms. We get away with using the x-no-mmap=on flag for debugging, but it's not unheard of that the device also doesn't work quite correctly with that flag, which could be due to access width or timing difference. So really, I don't see why we wouldn't want to maintain the guest access width through QEMU and the kernel interface for all devices. It seems like that should be our default vfio-pci implementation. I think we chose the current width based on the QEMU implementation that was already splitting accesses, and it (mostly) worked. Thanks, But unless you think that allowing more flexibility than the PCI spec dictates (byte-aligned up to 4K rather than dword-aligned up to 4K, see above) this still wouldn't allow me to solve the issue I'm trying to with this patch set. If you DO think allowing byte-aligned access up to 4K is OK, then I'm still left with a further issue (@Niklas): I'm also using this region-based approach to ensure that the host uses a matching inter
Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
On Mon, 25 Jan 2021 09:40:38 -0500 Matthew Rosato wrote: > On 1/22/21 6:48 PM, Alex Williamson wrote: > > On Tue, 19 Jan 2021 15:02:30 -0500 > > Matthew Rosato wrote: > > > >> Some s390 PCI devices (e.g. ISM) perform I/O operations that have very > >> specific requirements in terms of alignment as well as the patterns in > >> which the data is read/written. Allowing these to proceed through the > >> typical vfio_pci_bar_rw path will cause them to be broken in up in such a > >> way that these requirements can't be guaranteed. In addition, ISM devices > >> do not support the MIO codepaths that might be triggered on vfio I/O coming > >> from userspace; we must be able to ensure that these devices use the > >> non-MIO instructions. To facilitate this, provide a new vfio region by > >> which non-MIO instructions can be passed directly to the host kernel s390 > >> PCI layer, to be reliably issued as non-MIO instructions. > >> > >> This patch introduces the new vfio VFIO_REGION_SUBTYPE_IBM_ZPCI_IO region > >> and implements the ability to pass PCISTB and PCILG instructions over it, > >> as these are what is required for ISM devices. > > > > There have been various discussions about splitting vfio-pci to allow > > more device specific drivers rather adding duct tape and bailing wire > > for various device specific features to extend vfio-pci. The latest > > iteration is here[1]. Is it possible that such a solution could simply > > provide the standard BAR region indexes, but with an implementation that > > works on s390, rather than creating new device specific regions to > > perform the same task? Thanks, > > > > Alex > > > > [1]https://lore.kernel.org/lkml/20210117181534.65724-1-mgurto...@nvidia.com/ > > > > Thanks for the pointer, I'll have to keep an eye on this. An approach > like this could solve some issues, but I think a main issue that still > remains with relying on the standard BAR region indexes (whether using > the current vfio-pci driver or a device-specific driver) is that QEMU > writes to said BAR memory region are happening in, at most, 8B chunks > (which then, in the current general-purpose vfio-pci code get further > split up into 4B iowrite operations). The alternate approach I'm > proposing here is allowing for the whole payload (4K) in a single > operation, which is significantly faster. So, I suspect even with a > device specific driver we'd want this sort of a region anyhow.. Why is this device specific behavior? It would be a fair argument that acceptable device access widths for MMIO are always device specific, so we should never break them down. Looking at the PCI spec, a TLP requires a dword (4-byte) aligned address with a 10-bit length field indicating the number of dwords, so up to 4K data as you suggest is the whole payload. It's quite possible that the reason we don't have more access width problems is that MMIO is typically mmap'd on other platforms. We get away with using the x-no-mmap=on flag for debugging, but it's not unheard of that the device also doesn't work quite correctly with that flag, which could be due to access width or timing difference. So really, I don't see why we wouldn't want to maintain the guest access width through QEMU and the kernel interface for all devices. It seems like that should be our default vfio-pci implementation. I think we chose the current width based on the QEMU implementation that was already splitting accesses, and it (mostly) worked. Thanks, Alex
Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
On Mon, 25 Jan 2021 09:40:38 -0500 Matthew Rosato wrote: > On 1/22/21 6:48 PM, Alex Williamson wrote: > > On Tue, 19 Jan 2021 15:02:30 -0500 > > Matthew Rosato wrote: > > > >> Some s390 PCI devices (e.g. ISM) perform I/O operations that have very > >> specific requirements in terms of alignment as well as the patterns in > >> which the data is read/written. Allowing these to proceed through the > >> typical vfio_pci_bar_rw path will cause them to be broken in up in such a > >> way that these requirements can't be guaranteed. In addition, ISM devices > >> do not support the MIO codepaths that might be triggered on vfio I/O coming > >> from userspace; we must be able to ensure that these devices use the > >> non-MIO instructions. To facilitate this, provide a new vfio region by > >> which non-MIO instructions can be passed directly to the host kernel s390 > >> PCI layer, to be reliably issued as non-MIO instructions. > >> > >> This patch introduces the new vfio VFIO_REGION_SUBTYPE_IBM_ZPCI_IO region > >> and implements the ability to pass PCISTB and PCILG instructions over it, > >> as these are what is required for ISM devices. > > > > There have been various discussions about splitting vfio-pci to allow > > more device specific drivers rather adding duct tape and bailing wire > > for various device specific features to extend vfio-pci. The latest > > iteration is here[1]. Is it possible that such a solution could simply > > provide the standard BAR region indexes, but with an implementation that > > works on s390, rather than creating new device specific regions to > > perform the same task? Thanks, > > > > Alex > > > > [1]https://lore.kernel.org/lkml/20210117181534.65724-1-mgurto...@nvidia.com/ > > > > Thanks for the pointer, I'll have to keep an eye on this. An approach > like this could solve some issues, but I think a main issue that still > remains with relying on the standard BAR region indexes (whether using > the current vfio-pci driver or a device-specific driver) is that QEMU > writes to said BAR memory region are happening in, at most, 8B chunks > (which then, in the current general-purpose vfio-pci code get further > split up into 4B iowrite operations). The alternate approach I'm > proposing here is allowing for the whole payload (4K) in a single > operation, which is significantly faster. So, I suspect even with a > device specific driver we'd want this sort of a region anyhow.. I'm also wondering about device specific vs architecture/platform specific handling. If we're trying to support ISM devices, that's device specific handling; but if we're trying to add more generic things like the large payload support, that's not necessarily tied to a device, is it? For example, could a device support large payload if plugged into a z, but not if plugged into another machine?
Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
On 1/22/21 6:48 PM, Alex Williamson wrote: On Tue, 19 Jan 2021 15:02:30 -0500 Matthew Rosato wrote: Some s390 PCI devices (e.g. ISM) perform I/O operations that have very specific requirements in terms of alignment as well as the patterns in which the data is read/written. Allowing these to proceed through the typical vfio_pci_bar_rw path will cause them to be broken in up in such a way that these requirements can't be guaranteed. In addition, ISM devices do not support the MIO codepaths that might be triggered on vfio I/O coming from userspace; we must be able to ensure that these devices use the non-MIO instructions. To facilitate this, provide a new vfio region by which non-MIO instructions can be passed directly to the host kernel s390 PCI layer, to be reliably issued as non-MIO instructions. This patch introduces the new vfio VFIO_REGION_SUBTYPE_IBM_ZPCI_IO region and implements the ability to pass PCISTB and PCILG instructions over it, as these are what is required for ISM devices. There have been various discussions about splitting vfio-pci to allow more device specific drivers rather adding duct tape and bailing wire for various device specific features to extend vfio-pci. The latest iteration is here[1]. Is it possible that such a solution could simply provide the standard BAR region indexes, but with an implementation that works on s390, rather than creating new device specific regions to perform the same task? Thanks, Alex [1]https://lore.kernel.org/lkml/20210117181534.65724-1-mgurto...@nvidia.com/ Thanks for the pointer, I'll have to keep an eye on this. An approach like this could solve some issues, but I think a main issue that still remains with relying on the standard BAR region indexes (whether using the current vfio-pci driver or a device-specific driver) is that QEMU writes to said BAR memory region are happening in, at most, 8B chunks (which then, in the current general-purpose vfio-pci code get further split up into 4B iowrite operations). The alternate approach I'm proposing here is allowing for the whole payload (4K) in a single operation, which is significantly faster. So, I suspect even with a device specific driver we'd want this sort of a region anyhow..
Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
On 1/25/21 10:42 AM, Cornelia Huck wrote: On Mon, 25 Jan 2021 09:40:38 -0500 Matthew Rosato wrote: On 1/22/21 6:48 PM, Alex Williamson wrote: On Tue, 19 Jan 2021 15:02:30 -0500 Matthew Rosato wrote: Some s390 PCI devices (e.g. ISM) perform I/O operations that have very specific requirements in terms of alignment as well as the patterns in which the data is read/written. Allowing these to proceed through the typical vfio_pci_bar_rw path will cause them to be broken in up in such a way that these requirements can't be guaranteed. In addition, ISM devices do not support the MIO codepaths that might be triggered on vfio I/O coming from userspace; we must be able to ensure that these devices use the non-MIO instructions. To facilitate this, provide a new vfio region by which non-MIO instructions can be passed directly to the host kernel s390 PCI layer, to be reliably issued as non-MIO instructions. This patch introduces the new vfio VFIO_REGION_SUBTYPE_IBM_ZPCI_IO region and implements the ability to pass PCISTB and PCILG instructions over it, as these are what is required for ISM devices. There have been various discussions about splitting vfio-pci to allow more device specific drivers rather adding duct tape and bailing wire for various device specific features to extend vfio-pci. The latest iteration is here[1]. Is it possible that such a solution could simply provide the standard BAR region indexes, but with an implementation that works on s390, rather than creating new device specific regions to perform the same task? Thanks, Alex [1]https://lore.kernel.org/lkml/20210117181534.65724-1-mgurto...@nvidia.com/ Thanks for the pointer, I'll have to keep an eye on this. An approach like this could solve some issues, but I think a main issue that still remains with relying on the standard BAR region indexes (whether using the current vfio-pci driver or a device-specific driver) is that QEMU writes to said BAR memory region are happening in, at most, 8B chunks (which then, in the current general-purpose vfio-pci code get further split up into 4B iowrite operations). The alternate approach I'm proposing here is allowing for the whole payload (4K) in a single operation, which is significantly faster. So, I suspect even with a device specific driver we'd want this sort of a region anyhow.. I'm also wondering about device specific vs architecture/platform specific handling. If we're trying to support ISM devices, that's device specific handling; but if we're trying to add more generic things like the large payload support, that's not necessarily tied to a device, is it? For example, could a device support large payload if plugged into a z, but not if plugged into another machine? > Yes, that's correct -- While ISM is providing the impetus and has a hard requirement for some of this due to the MIO instruction quirk, the mechanism being implemented here is definitely not ISM-specific -- it's more like an s390-wide quirk that could really benefit any device that wants to do large payloads (PCISTB). And I think that ultimately goes back to why Pierre wanted to have QEMU be as permissive as possible in using the region vs limiting it only to ISM.
Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
On Tue, 19 Jan 2021 15:02:30 -0500 Matthew Rosato wrote: > Some s390 PCI devices (e.g. ISM) perform I/O operations that have very > specific requirements in terms of alignment as well as the patterns in > which the data is read/written. Allowing these to proceed through the > typical vfio_pci_bar_rw path will cause them to be broken in up in such a > way that these requirements can't be guaranteed. In addition, ISM devices > do not support the MIO codepaths that might be triggered on vfio I/O coming > from userspace; we must be able to ensure that these devices use the > non-MIO instructions. To facilitate this, provide a new vfio region by > which non-MIO instructions can be passed directly to the host kernel s390 > PCI layer, to be reliably issued as non-MIO instructions. > > This patch introduces the new vfio VFIO_REGION_SUBTYPE_IBM_ZPCI_IO region > and implements the ability to pass PCISTB and PCILG instructions over it, > as these are what is required for ISM devices. There have been various discussions about splitting vfio-pci to allow more device specific drivers rather adding duct tape and bailing wire for various device specific features to extend vfio-pci. The latest iteration is here[1]. Is it possible that such a solution could simply provide the standard BAR region indexes, but with an implementation that works on s390, rather than creating new device specific regions to perform the same task? Thanks, Alex [1]https://lore.kernel.org/lkml/20210117181534.65724-1-mgurto...@nvidia.com/ > Signed-off-by: Matthew Rosato > --- > drivers/vfio/pci/vfio_pci.c | 8 ++ > drivers/vfio/pci/vfio_pci_private.h | 6 ++ > drivers/vfio/pci/vfio_pci_zdev.c| 158 > > include/uapi/linux/vfio.h | 4 + > include/uapi/linux/vfio_zdev.h | 33 > 5 files changed, 209 insertions(+) > > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c > index 706de3e..e1c156e 100644 > --- a/drivers/vfio/pci/vfio_pci.c > +++ b/drivers/vfio/pci/vfio_pci.c > @@ -407,6 +407,14 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev) > } > } > > + if (IS_ENABLED(CONFIG_VFIO_PCI_ZDEV)) { > + ret = vfio_pci_zdev_io_init(vdev); > + if (ret && ret != -ENODEV) { > + pci_warn(pdev, "Failed to setup zPCI I/O region\n"); > + return ret; > + } > + } > + > vfio_pci_probe_mmaps(vdev); > > return 0; > diff --git a/drivers/vfio/pci/vfio_pci_private.h > b/drivers/vfio/pci/vfio_pci_private.h > index 5c90e56..bc49980 100644 > --- a/drivers/vfio/pci/vfio_pci_private.h > +++ b/drivers/vfio/pci/vfio_pci_private.h > @@ -217,12 +217,18 @@ static inline int vfio_pci_ibm_npu2_init(struct > vfio_pci_device *vdev) > #ifdef CONFIG_VFIO_PCI_ZDEV > extern int vfio_pci_info_zdev_add_caps(struct vfio_pci_device *vdev, > struct vfio_info_cap *caps); > +extern int vfio_pci_zdev_io_init(struct vfio_pci_device *vdev); > #else > static inline int vfio_pci_info_zdev_add_caps(struct vfio_pci_device *vdev, > struct vfio_info_cap *caps) > { > return -ENODEV; > } > + > +static inline int vfio_pci_zdev_io_init(struct vfio_pci_device *vdev) > +{ > + return -ENODEV; > +} > #endif > > #endif /* VFIO_PCI_PRIVATE_H */ > diff --git a/drivers/vfio/pci/vfio_pci_zdev.c > b/drivers/vfio/pci/vfio_pci_zdev.c > index 57e19ff..a962043 100644 > --- a/drivers/vfio/pci/vfio_pci_zdev.c > +++ b/drivers/vfio/pci/vfio_pci_zdev.c > @@ -18,6 +18,7 @@ > #include > #include > #include > +#include > > #include "vfio_pci_private.h" > > @@ -143,3 +144,160 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_device > *vdev, > > return ret; > } > + > +static size_t vfio_pci_zdev_io_rw(struct vfio_pci_device *vdev, > + char __user *buf, size_t count, > + loff_t *ppos, bool iswrite) > +{ > + unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS; > + struct vfio_region_zpci_io *region = vdev->region[i].data; > + struct zpci_dev *zdev = to_zpci(vdev->pdev); > + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK; > + void *base = region; > + struct page *gpage; > + void *gaddr; > + u64 *data; > + int ret; > + u64 req; > + > + if ((!vdev->pdev->bus) || (!zdev)) > + return -ENODEV; > + > + if (pos >= vdev->region[i].size) > + return -EINVAL; > + > + count = min(count, (size_t)(vdev->region[i].size - pos)); > + > + if (!iswrite) { > + /* Only allow reads to the _hdr area */ > + if (pos + count > offsetof(struct vfio_region_zpci_io, req)) > + return -EFAULT; > + if (copy_to_user(buf, base + pos, count)) > + return -EFAULT; > + return
Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
On Wed, 20 Jan 2021 14:21:59 +0100 Niklas Schnelle wrote: > On 1/19/21 9:02 PM, Matthew Rosato wrote: > > Some s390 PCI devices (e.g. ISM) perform I/O operations that have very > .. snip ... > > + > > +static size_t vfio_pci_zdev_io_rw(struct vfio_pci_device *vdev, > > + char __user *buf, size_t count, > > + loff_t *ppos, bool iswrite) > > +{ > ... snip ... > > + /* > > +* For now, the largest allowed block I/O is advertised as PAGE_SIZE, > > +* and cannot exceed a page boundary - so a single page is enough. The > > +* guest should have validated this but let's double-check that the > > +* request will not cross a page boundary. > > +*/ > > + if (((region->req.gaddr & ~PAGE_MASK) > > + + region->req.len - 1) & PAGE_MASK) { > > + return -EIO; > > + } > > + > > + mutex_lock(&zdev->lock); > > I plan on using the zdev->lock for preventing concurrent zPCI devices > removal/configuration state changes between zPCI availability/error > events and enable_slot()/disable_slot() and > /sys/bus/pci/devices//recover. > > With that use in place using it here causes a deadlock when doing > "echo 0 > /sys/bus/pci/slots//power from the host for an ISM device > attached to a guest. > > This is because the (soft) hot unplug will cause vfio to notify QEMU, which > sends a deconfiguration request to the guest, which then tries to > gracefully shutdown the device. During that shutdown the device will > be accessed, running into this code path which then blocks on > the lock held by the disable_slot() code which waits on vfio > releasing the device. > > Alex may correct me if I'm wrong but I think instead vfio should > be holding the PCI device lock via pci_device_lock(pdev). There be dragons in device_lock, which is why we have all the crude try-lock semantics in reset paths. Don't use it trivially. Device lock is not necessary here otherwise, the device is bound to a driver and has an open userspace file descriptor through which the user is performing this access. The device can't be removed without unbinding the driver, which can't happen while the user still has open files to it. Thanks, Alex
Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
On 1/21/21 5:01 AM, Niklas Schnelle wrote: On 1/19/21 9:02 PM, Matthew Rosato wrote: Some s390 PCI devices (e.g. ISM) perform I/O operations that have very specific requirements in terms of alignment as well as the patterns in which the data is read/written. Allowing these to proceed through the typical vfio_pci_bar_rw path will cause them to be broken in up in such a way that these requirements can't be guaranteed. In addition, ISM devices do not support the MIO codepaths that might be triggered on vfio I/O coming from userspace; we must be able to ensure that these devices use the non-MIO instructions. To facilitate this, provide a new vfio region by which non-MIO instructions can be passed directly to the host kernel s390 PCI layer, to be reliably issued as non-MIO instructions. This patch introduces the new vfio VFIO_REGION_SUBTYPE_IBM_ZPCI_IO region and implements the ability to pass PCISTB and PCILG instructions over it, as these are what is required for ISM devices. Signed-off-by: Matthew Rosato --- drivers/vfio/pci/vfio_pci.c | 8 ++ drivers/vfio/pci/vfio_pci_private.h | 6 ++ drivers/vfio/pci/vfio_pci_zdev.c| 158 include/uapi/linux/vfio.h | 4 + include/uapi/linux/vfio_zdev.h | 33 5 files changed, 209 insertions(+) Related to the discussion on the QEMU side, if we have a check to make sure this is only used for ISM, then this patch should make that clear in its wording and also in the paths (drivers/vfio/pci/vfio_pci_ism.c instead of vfio_pci_zdev.c.) This also has precedent with the region for IGD in drivers/vfio/pci/vfio_pci_igd.c. This is a fair point, but just to tie up threads here -- the QEMU discussion has since moved towards making the use-case less restrictive rather than ISM-only (though getting ISM working is still the motivating factor here). So as such I think I'd prefer to keep this in vfio_pci_zdev.c as other hardware could use the region if they meet the necessary criteria.
Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
On 1/19/21 9:02 PM, Matthew Rosato wrote: > Some s390 PCI devices (e.g. ISM) perform I/O operations that have very > specific requirements in terms of alignment as well as the patterns in > which the data is read/written. Allowing these to proceed through the > typical vfio_pci_bar_rw path will cause them to be broken in up in such a > way that these requirements can't be guaranteed. In addition, ISM devices > do not support the MIO codepaths that might be triggered on vfio I/O coming > from userspace; we must be able to ensure that these devices use the > non-MIO instructions. To facilitate this, provide a new vfio region by > which non-MIO instructions can be passed directly to the host kernel s390 > PCI layer, to be reliably issued as non-MIO instructions. > > This patch introduces the new vfio VFIO_REGION_SUBTYPE_IBM_ZPCI_IO region > and implements the ability to pass PCISTB and PCILG instructions over it, > as these are what is required for ISM devices. > > Signed-off-by: Matthew Rosato > --- > drivers/vfio/pci/vfio_pci.c | 8 ++ > drivers/vfio/pci/vfio_pci_private.h | 6 ++ > drivers/vfio/pci/vfio_pci_zdev.c| 158 > > include/uapi/linux/vfio.h | 4 + > include/uapi/linux/vfio_zdev.h | 33 > 5 files changed, 209 insertions(+) Related to the discussion on the QEMU side, if we have a check to make sure this is only used for ISM, then this patch should make that clear in its wording and also in the paths (drivers/vfio/pci/vfio_pci_ism.c instead of vfio_pci_zdev.c.) This also has precedent with the region for IGD in drivers/vfio/pci/vfio_pci_igd.c.
Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
On 1/20/21 12:28 PM, Niklas Schnelle wrote: On 1/20/21 6:10 PM, Matthew Rosato wrote: On 1/20/21 8:21 AM, Niklas Schnelle wrote: On 1/19/21 9:02 PM, Matthew Rosato wrote: Some s390 PCI devices (e.g. ISM) perform I/O operations that have very .. snip ... + +static size_t vfio_pci_zdev_io_rw(struct vfio_pci_device *vdev, + char __user *buf, size_t count, + loff_t *ppos, bool iswrite) +{ ... snip ... + /* + * For now, the largest allowed block I/O is advertised as PAGE_SIZE, + * and cannot exceed a page boundary - so a single page is enough. The + * guest should have validated this but let's double-check that the + * request will not cross a page boundary. + */ + if (((region->req.gaddr & ~PAGE_MASK) + + region->req.len - 1) & PAGE_MASK) { + return -EIO; + } + + mutex_lock(&zdev->lock); I plan on using the zdev->lock for preventing concurrent zPCI devices removal/configuration state changes between zPCI availability/error events and enable_slot()/disable_slot() and /sys/bus/pci/devices//recover. With that use in place using it here causes a deadlock when doing "echo 0 > /sys/bus/pci/slots//power from the host for an ISM device attached to a guest. This is because the (soft) hot unplug will cause vfio to notify QEMU, which sends a deconfiguration request to the guest, which then tries to gracefully shutdown the device. During that shutdown the device will be accessed, running into this code path which then blocks on the lock held by the disable_slot() code which waits on vfio releasing the device. Oof, good catch. The primary purpose of acquiring the zdev lock here was to ensure that the region is only being used to process one operation at a time and at the time I wrote this initially the zdev lock seemed like a reasonable candidate :) Oh ok, I thought it was to guard against removal. What's the reason we can't have multiple concurrent users of the region though? The PCISTB/PCILG should be able to be executed concurrently but I'm probably missing something. Yes, we are protecting the vfio region data, not PCISTB/PCILG execution. When we enter vfio_pci_zdev_io_rw() with an I/O operation (a 'write' operation from the guest) we read in the data from the guest and place it into vdev->region[].data. The data being read in is effectively instructing us what I/O operation to perform. Now, if you have 2 threads performing a write operation to this function at the same time (and furthermore, at the same or overlapping positions), they would overwrite (or collide) with each other, resulting in contents written to vdev->region[].data that are unpredictable and thus the wrong I/O operations probably done. It so happens that the QEMU implementation will ensure that won't happen, but the kernel interface also needs to make sure it doesn't happen. Alex may correct me if I'm wrong but I think instead vfio should be holding the PCI device lock via pci_device_lock(pdev). OK, I can have a look at this. Thanks. Hmm, ok that idea was based on my assumption that it guards against removal. My bad, of course without my patch it wouldn't and this came before.. That would be a nice added bonus (we discussed it offline late last year so that's probably why you're thinking this). The annotated trace with my test code looks as follows: [ 618.025091] Call Trace: [ 618.025093] [<0007c1a139e0>] __schedule+0x360/0x960 [ 618.025104] [<0007c1a14760>] schedule_preempt_disabled+0x60/0x100 [ 618.025107] [<0007c1a16b48>] __mutex_lock+0x358/0x880 [ 618.025110] [<0007c1a170a2>] mutex_lock_nested+0x32/0x40 [ 618.025112] [<03ff805a3948>] vfio_pci_zdev_io_rw+0x168/0x310 [vfio_pci] [ 618.025120] [<03ff8059b2b0>] vfio_pci_write+0xd0/0xe0 [vfio_pci] [ 618.025124] [<0007c0fa5392>] __s390x_sys_pwrite64+0x112/0x360 [ 618.025129] [<0007c1a0aaf6>] __do_syscall+0x116/0x190 [ 618.025132] [<0007c1a1deda>] system_call+0x72/0x98 [ 618.025137] 1 lock held by qemu-system-s39/1315: [ 618.025139] #0: 8524b4e8 (&zdev->lock){}-{3:3}, at: vfio_pci_zdev_io_rw+0x168/0x310 [vfio_pci] [ 618.025151] Showing all locks held in the system: [ 618.025166] 1 lock held by khungtaskd/99: [ 618.025168] #0: 0007c1ed4748 (rcu_read_lock){}-{1:2}, at: rcu_lock_acquire.constprop.0+0x0/0x210 [ 618.025194] 6 locks held by zsh/1190: [ 618.025196] #0: 95fc0488 (sb_writers#3){}-{0:0}, at: __do_syscall+0x116/0x190 [ 618.025226] #1: 975bf090 (&of->mutex){}-{3:3}, at: kernfs_fop_write+0x9a/0x240 [ 618.025236] #2: 8584be78 (kn->active#245){}-{0:0}, at: kernfs_fop_write+0xa6/0x240 [ 618.025243] #3: 8524b4e8 (&zdev->lock){}-{3:3}, at: disable_slot+0x32/0x130 <-| [ 618.025252] #4: 0007c1f53468 (pci_rescan_remove_lock){}-{3:3}, at: pci_stop_and_remove
Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
On 1/20/21 6:10 PM, Matthew Rosato wrote: > On 1/20/21 8:21 AM, Niklas Schnelle wrote: >> >> >> On 1/19/21 9:02 PM, Matthew Rosato wrote: >>> Some s390 PCI devices (e.g. ISM) perform I/O operations that have very >> .. snip ... >>> + >>> +static size_t vfio_pci_zdev_io_rw(struct vfio_pci_device *vdev, >>> + char __user *buf, size_t count, >>> + loff_t *ppos, bool iswrite) >>> +{ >> ... snip ... >>> + /* >>> + * For now, the largest allowed block I/O is advertised as PAGE_SIZE, >>> + * and cannot exceed a page boundary - so a single page is enough. The >>> + * guest should have validated this but let's double-check that the >>> + * request will not cross a page boundary. >>> + */ >>> + if (((region->req.gaddr & ~PAGE_MASK) >>> + + region->req.len - 1) & PAGE_MASK) { >>> + return -EIO; >>> + } >>> + >>> + mutex_lock(&zdev->lock); >> >> I plan on using the zdev->lock for preventing concurrent zPCI devices >> removal/configuration state changes between zPCI availability/error >> events and enable_slot()/disable_slot() and >> /sys/bus/pci/devices//recover. >> >> With that use in place using it here causes a deadlock when doing >> "echo 0 > /sys/bus/pci/slots//power from the host for an ISM device >> attached to a guest. >> >> This is because the (soft) hot unplug will cause vfio to notify QEMU, which >> sends a deconfiguration request to the guest, which then tries to >> gracefully shutdown the device. During that shutdown the device will >> be accessed, running into this code path which then blocks on >> the lock held by the disable_slot() code which waits on vfio >> releasing the device. >> > > Oof, good catch. The primary purpose of acquiring the zdev lock here was to > ensure that the region is only being used to process one operation at a time > and at the time I wrote this initially the zdev lock seemed like a reasonable > candidate :) Oh ok, I thought it was to guard against removal. What's the reason we can't have multiple concurrent users of the region though? The PCISTB/PCILG should be able to be executed concurrently but I'm probably missing something. > >> Alex may correct me if I'm wrong but I think instead vfio should >> be holding the PCI device lock via pci_device_lock(pdev). >> > > OK, I can have a look at this. Thanks. Hmm, ok that idea was based on my assumption that it guards against removal. My bad, of course without my patch it wouldn't and this came before.. > > >> The annotated trace with my test code looks as follows: >> >> [ 618.025091] Call Trace: >> [ 618.025093] [<0007c1a139e0>] __schedule+0x360/0x960 >> [ 618.025104] [<0007c1a14760>] schedule_preempt_disabled+0x60/0x100 >> [ 618.025107] [<0007c1a16b48>] __mutex_lock+0x358/0x880 >> [ 618.025110] [<0007c1a170a2>] mutex_lock_nested+0x32/0x40 >> [ 618.025112] [<03ff805a3948>] vfio_pci_zdev_io_rw+0x168/0x310 >> [vfio_pci] >> [ 618.025120] [<03ff8059b2b0>] vfio_pci_write+0xd0/0xe0 [vfio_pci] >> [ 618.025124] [<0007c0fa5392>] __s390x_sys_pwrite64+0x112/0x360 >> [ 618.025129] [<0007c1a0aaf6>] __do_syscall+0x116/0x190 >> [ 618.025132] [<0007c1a1deda>] system_call+0x72/0x98 >> [ 618.025137] 1 lock held by qemu-system-s39/1315: >> [ 618.025139] #0: 8524b4e8 (&zdev->lock){}-{3:3}, at: >> vfio_pci_zdev_io_rw+0x168/0x310 [vfio_pci] >> [ 618.025151] >> Showing all locks held in the system: >> [ 618.025166] 1 lock held by khungtaskd/99: >> [ 618.025168] #0: 0007c1ed4748 (rcu_read_lock){}-{1:2}, at: >> rcu_lock_acquire.constprop.0+0x0/0x210 >> [ 618.025194] 6 locks held by zsh/1190: >> [ 618.025196] #0: 95fc0488 (sb_writers#3){}-{0:0}, at: >> __do_syscall+0x116/0x190 >> [ 618.025226] #1: 975bf090 (&of->mutex){}-{3:3}, at: >> kernfs_fop_write+0x9a/0x240 >> [ 618.025236] #2: 8584be78 (kn->active#245){}-{0:0}, at: >> kernfs_fop_write+0xa6/0x240 >> [ 618.025243] #3: 8524b4e8 (&zdev->lock){}-{3:3}, at: >> disable_slot+0x32/0x130 <-| >> [ 618.025252] #4: 0007c1f53468 (pci_rescan_remove_lock){}-{3:3}, >> at: pci_stop_and_remove_bus_device_locked+0x26/0x240 | >> [ 618.025260] #5: 85d8a1a0 (&dev->mutex){}-{3:3}, at: >> device_release_driver+0x32/0x1d0 | >> [ 618.025271] 1 lock held by qemu-system-s39/1312: >> D >> [ 618.025273] 1 lock held by qemu-system-s39/1313: >> E >> [ 618.025275] #0: d47e80d0 (&vcpu->mutex){}-{3:3}, at: >> kvm_vcpu_ioctl+0x90/0x780 [kvm] A >> [ 618.025322] 1 lock held by qemu-system-s39/1314: >> D >> [ 6
Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
On 1/20/21 8:21 AM, Niklas Schnelle wrote: On 1/19/21 9:02 PM, Matthew Rosato wrote: Some s390 PCI devices (e.g. ISM) perform I/O operations that have very .. snip ... + +static size_t vfio_pci_zdev_io_rw(struct vfio_pci_device *vdev, + char __user *buf, size_t count, + loff_t *ppos, bool iswrite) +{ ... snip ... + /* +* For now, the largest allowed block I/O is advertised as PAGE_SIZE, +* and cannot exceed a page boundary - so a single page is enough. The +* guest should have validated this but let's double-check that the +* request will not cross a page boundary. +*/ + if (((region->req.gaddr & ~PAGE_MASK) + + region->req.len - 1) & PAGE_MASK) { + return -EIO; + } + + mutex_lock(&zdev->lock); I plan on using the zdev->lock for preventing concurrent zPCI devices removal/configuration state changes between zPCI availability/error events and enable_slot()/disable_slot() and /sys/bus/pci/devices//recover. With that use in place using it here causes a deadlock when doing "echo 0 > /sys/bus/pci/slots//power from the host for an ISM device attached to a guest. This is because the (soft) hot unplug will cause vfio to notify QEMU, which sends a deconfiguration request to the guest, which then tries to gracefully shutdown the device. During that shutdown the device will be accessed, running into this code path which then blocks on the lock held by the disable_slot() code which waits on vfio releasing the device. Oof, good catch. The primary purpose of acquiring the zdev lock here was to ensure that the region is only being used to process one operation at a time and at the time I wrote this initially the zdev lock seemed like a reasonable candidate :) Alex may correct me if I'm wrong but I think instead vfio should be holding the PCI device lock via pci_device_lock(pdev). OK, I can have a look at this. Thanks. The annotated trace with my test code looks as follows: [ 618.025091] Call Trace: [ 618.025093] [<0007c1a139e0>] __schedule+0x360/0x960 [ 618.025104] [<0007c1a14760>] schedule_preempt_disabled+0x60/0x100 [ 618.025107] [<0007c1a16b48>] __mutex_lock+0x358/0x880 [ 618.025110] [<0007c1a170a2>] mutex_lock_nested+0x32/0x40 [ 618.025112] [<03ff805a3948>] vfio_pci_zdev_io_rw+0x168/0x310 [vfio_pci] [ 618.025120] [<03ff8059b2b0>] vfio_pci_write+0xd0/0xe0 [vfio_pci] [ 618.025124] [<0007c0fa5392>] __s390x_sys_pwrite64+0x112/0x360 [ 618.025129] [<0007c1a0aaf6>] __do_syscall+0x116/0x190 [ 618.025132] [<0007c1a1deda>] system_call+0x72/0x98 [ 618.025137] 1 lock held by qemu-system-s39/1315: [ 618.025139] #0: 8524b4e8 (&zdev->lock){}-{3:3}, at: vfio_pci_zdev_io_rw+0x168/0x310 [vfio_pci] [ 618.025151] Showing all locks held in the system: [ 618.025166] 1 lock held by khungtaskd/99: [ 618.025168] #0: 0007c1ed4748 (rcu_read_lock){}-{1:2}, at: rcu_lock_acquire.constprop.0+0x0/0x210 [ 618.025194] 6 locks held by zsh/1190: [ 618.025196] #0: 95fc0488 (sb_writers#3){}-{0:0}, at: __do_syscall+0x116/0x190 [ 618.025226] #1: 975bf090 (&of->mutex){}-{3:3}, at: kernfs_fop_write+0x9a/0x240 [ 618.025236] #2: 8584be78 (kn->active#245){}-{0:0}, at: kernfs_fop_write+0xa6/0x240 [ 618.025243] #3: 8524b4e8 (&zdev->lock){}-{3:3}, at: disable_slot+0x32/0x130 <-| [ 618.025252] #4: 0007c1f53468 (pci_rescan_remove_lock){}-{3:3}, at: pci_stop_and_remove_bus_device_locked+0x26/0x240 | [ 618.025260] #5: 85d8a1a0 (&dev->mutex){}-{3:3}, at: device_release_driver+0x32/0x1d0 | [ 618.025271] 1 lock held by qemu-system-s39/1312: D [ 618.025273] 1 lock held by qemu-system-s39/1313: E [ 618.025275] #0: d47e80d0 (&vcpu->mutex){}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm] A [ 618.025322] 1 lock held by qemu-system-s39/1314: D [ 618.025324] #0: d34700d0 (&vcpu->mutex){}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm] | [ 618.025345] 1 lock held by qemu-system-s39/1315: | [ 618.025347] #0: 8524b4e8 (&zdev->lock){}-{3:3}, at: vfio_pci_zdev_io_rw+0x168/0x310 [vfio_pci] <--| [ 618.025355] 1 lock held by qemu-system-s39/1317: [ 618.025357] #0: d34480d0 (&vcpu->mutex){}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm] [ 618.025378] 1 lock held by qemu-system-s39/1318: [ 618.025380] #0: 0
Re: [PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
On 1/19/21 9:02 PM, Matthew Rosato wrote: > Some s390 PCI devices (e.g. ISM) perform I/O operations that have very .. snip ... > + > +static size_t vfio_pci_zdev_io_rw(struct vfio_pci_device *vdev, > + char __user *buf, size_t count, > + loff_t *ppos, bool iswrite) > +{ ... snip ... > + /* > + * For now, the largest allowed block I/O is advertised as PAGE_SIZE, > + * and cannot exceed a page boundary - so a single page is enough. The > + * guest should have validated this but let's double-check that the > + * request will not cross a page boundary. > + */ > + if (((region->req.gaddr & ~PAGE_MASK) > + + region->req.len - 1) & PAGE_MASK) { > + return -EIO; > + } > + > + mutex_lock(&zdev->lock); I plan on using the zdev->lock for preventing concurrent zPCI devices removal/configuration state changes between zPCI availability/error events and enable_slot()/disable_slot() and /sys/bus/pci/devices//recover. With that use in place using it here causes a deadlock when doing "echo 0 > /sys/bus/pci/slots//power from the host for an ISM device attached to a guest. This is because the (soft) hot unplug will cause vfio to notify QEMU, which sends a deconfiguration request to the guest, which then tries to gracefully shutdown the device. During that shutdown the device will be accessed, running into this code path which then blocks on the lock held by the disable_slot() code which waits on vfio releasing the device. Alex may correct me if I'm wrong but I think instead vfio should be holding the PCI device lock via pci_device_lock(pdev). The annotated trace with my test code looks as follows: [ 618.025091] Call Trace: [ 618.025093] [<0007c1a139e0>] __schedule+0x360/0x960 [ 618.025104] [<0007c1a14760>] schedule_preempt_disabled+0x60/0x100 [ 618.025107] [<0007c1a16b48>] __mutex_lock+0x358/0x880 [ 618.025110] [<0007c1a170a2>] mutex_lock_nested+0x32/0x40 [ 618.025112] [<03ff805a3948>] vfio_pci_zdev_io_rw+0x168/0x310 [vfio_pci] [ 618.025120] [<03ff8059b2b0>] vfio_pci_write+0xd0/0xe0 [vfio_pci] [ 618.025124] [<0007c0fa5392>] __s390x_sys_pwrite64+0x112/0x360 [ 618.025129] [<0007c1a0aaf6>] __do_syscall+0x116/0x190 [ 618.025132] [<0007c1a1deda>] system_call+0x72/0x98 [ 618.025137] 1 lock held by qemu-system-s39/1315: [ 618.025139] #0: 8524b4e8 (&zdev->lock){}-{3:3}, at: vfio_pci_zdev_io_rw+0x168/0x310 [vfio_pci] [ 618.025151] Showing all locks held in the system: [ 618.025166] 1 lock held by khungtaskd/99: [ 618.025168] #0: 0007c1ed4748 (rcu_read_lock){}-{1:2}, at: rcu_lock_acquire.constprop.0+0x0/0x210 [ 618.025194] 6 locks held by zsh/1190: [ 618.025196] #0: 95fc0488 (sb_writers#3){}-{0:0}, at: __do_syscall+0x116/0x190 [ 618.025226] #1: 975bf090 (&of->mutex){}-{3:3}, at: kernfs_fop_write+0x9a/0x240 [ 618.025236] #2: 8584be78 (kn->active#245){}-{0:0}, at: kernfs_fop_write+0xa6/0x240 [ 618.025243] #3: 8524b4e8 (&zdev->lock){}-{3:3}, at: disable_slot+0x32/0x130 <-| [ 618.025252] #4: 0007c1f53468 (pci_rescan_remove_lock){}-{3:3}, at: pci_stop_and_remove_bus_device_locked+0x26/0x240 | [ 618.025260] #5: 85d8a1a0 (&dev->mutex){}-{3:3}, at: device_release_driver+0x32/0x1d0 | [ 618.025271] 1 lock held by qemu-system-s39/1312: D [ 618.025273] 1 lock held by qemu-system-s39/1313: E [ 618.025275] #0: d47e80d0 (&vcpu->mutex){}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm] A [ 618.025322] 1 lock held by qemu-system-s39/1314: D [ 618.025324] #0: d34700d0 (&vcpu->mutex){}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm] | [ 618.025345] 1 lock held by qemu-system-s39/1315: | [ 618.025347] #0: 8524b4e8 (&zdev->lock){}-{3:3}, at: vfio_pci_zdev_io_rw+0x168/0x310 [vfio_pci] <--| [ 618.025355] 1 lock held by qemu-system-s39/1317: [ 618.025357] #0: d34480d0 (&vcpu->mutex){}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm] [ 618.025378] 1 lock held by qemu-system-s39/1318: [ 618.025380] #0: d34380d0 (&vcpu->mutex){}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm] [ 618.025400] 1 lock held by qemu-system-s39/1319: [ 618.025403] #0: d47e8a90 (&vcpu->mutex){}-{3:3}, at: kvm_vcpu_ioctl+0x90/0x780 [kvm] [ 618.025424] 2 locks held by zsh/1391: [ 618.025426] #0: d4a708a0 (&tty->ldisc_sem){}-{0:0}
[PATCH 4/4] vfio-pci/zdev: Introduce the zPCI I/O vfio region
Some s390 PCI devices (e.g. ISM) perform I/O operations that have very specific requirements in terms of alignment as well as the patterns in which the data is read/written. Allowing these to proceed through the typical vfio_pci_bar_rw path will cause them to be broken in up in such a way that these requirements can't be guaranteed. In addition, ISM devices do not support the MIO codepaths that might be triggered on vfio I/O coming from userspace; we must be able to ensure that these devices use the non-MIO instructions. To facilitate this, provide a new vfio region by which non-MIO instructions can be passed directly to the host kernel s390 PCI layer, to be reliably issued as non-MIO instructions. This patch introduces the new vfio VFIO_REGION_SUBTYPE_IBM_ZPCI_IO region and implements the ability to pass PCISTB and PCILG instructions over it, as these are what is required for ISM devices. Signed-off-by: Matthew Rosato --- drivers/vfio/pci/vfio_pci.c | 8 ++ drivers/vfio/pci/vfio_pci_private.h | 6 ++ drivers/vfio/pci/vfio_pci_zdev.c| 158 include/uapi/linux/vfio.h | 4 + include/uapi/linux/vfio_zdev.h | 33 5 files changed, 209 insertions(+) diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 706de3e..e1c156e 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -407,6 +407,14 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev) } } + if (IS_ENABLED(CONFIG_VFIO_PCI_ZDEV)) { + ret = vfio_pci_zdev_io_init(vdev); + if (ret && ret != -ENODEV) { + pci_warn(pdev, "Failed to setup zPCI I/O region\n"); + return ret; + } + } + vfio_pci_probe_mmaps(vdev); return 0; diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h index 5c90e56..bc49980 100644 --- a/drivers/vfio/pci/vfio_pci_private.h +++ b/drivers/vfio/pci/vfio_pci_private.h @@ -217,12 +217,18 @@ static inline int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev) #ifdef CONFIG_VFIO_PCI_ZDEV extern int vfio_pci_info_zdev_add_caps(struct vfio_pci_device *vdev, struct vfio_info_cap *caps); +extern int vfio_pci_zdev_io_init(struct vfio_pci_device *vdev); #else static inline int vfio_pci_info_zdev_add_caps(struct vfio_pci_device *vdev, struct vfio_info_cap *caps) { return -ENODEV; } + +static inline int vfio_pci_zdev_io_init(struct vfio_pci_device *vdev) +{ + return -ENODEV; +} #endif #endif /* VFIO_PCI_PRIVATE_H */ diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c index 57e19ff..a962043 100644 --- a/drivers/vfio/pci/vfio_pci_zdev.c +++ b/drivers/vfio/pci/vfio_pci_zdev.c @@ -18,6 +18,7 @@ #include #include #include +#include #include "vfio_pci_private.h" @@ -143,3 +144,160 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_device *vdev, return ret; } + +static size_t vfio_pci_zdev_io_rw(struct vfio_pci_device *vdev, + char __user *buf, size_t count, + loff_t *ppos, bool iswrite) +{ + unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS; + struct vfio_region_zpci_io *region = vdev->region[i].data; + struct zpci_dev *zdev = to_zpci(vdev->pdev); + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK; + void *base = region; + struct page *gpage; + void *gaddr; + u64 *data; + int ret; + u64 req; + + if ((!vdev->pdev->bus) || (!zdev)) + return -ENODEV; + + if (pos >= vdev->region[i].size) + return -EINVAL; + + count = min(count, (size_t)(vdev->region[i].size - pos)); + + if (!iswrite) { + /* Only allow reads to the _hdr area */ + if (pos + count > offsetof(struct vfio_region_zpci_io, req)) + return -EFAULT; + if (copy_to_user(buf, base + pos, count)) + return -EFAULT; + return count; + } + + /* Only allow writes to the _req area */ + if (pos < offsetof(struct vfio_region_zpci_io, req)) + return -EFAULT; + if (copy_from_user(base + pos, buf, count)) + return -EFAULT; + + /* +* Read operations are limited to 8B +*/ + if ((region->req.flags & VFIO_ZPCI_IO_FLAG_READ) && + (region->req.len > 8)) { + return -EIO; + } + + /* +* Block write operations are limited to hardware-reported max +*/ + if ((region->req.flags & VFIO_ZPCI_IO_FLAG_BLOCKW) && + (region->req.len > zdev->maxstbl)) { + return -EIO; + } + + /* +* While some devices may a