Re: [Qemu-block] [Qemu-devel] [PATCH 1/5] nvme: PCI/e configuration from specification
On 9/12/18 2:53 PM, Gersner wrote: Hi Daniel, Sorry for the long round-trips, we had a busy month. We have implemented all the changes. Waiting for a final clarification. Should the new patches be posted on this thread or a new one? Best to post a v2 as a new top-level thread (our CI tools don't spot patch revisions buried in reply to an existing thread). Also, it's best to avoid top-posting on technical lists, as it's harder to follow the flow of your email. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [Qemu-devel] [PATCH 1/5] nvme: PCI/e configuration from specification
Hi Daniel, Sorry for the long round-trips, we had a busy month. We have implemented all the changes. Waiting for a final clarification. Should the new patches be posted on this thread or a new one? Thanks for you time. Gersner. On Thu, Aug 30, 2018 at 6:45 PM Daniel Verkamp wrote: > Hi Shimi, > > On Sun, Aug 26, 2018 at 2:50 PM Gersner wrote: > > > > Hi Daniel, > > Thanks for taking a look. Comments are inline. > > > > Gersner. > > > > On Sun, Jul 15, 2018 at 9:21 AM Daniel Verkamp wrote: > >> > >> On Fri, Jun 22, 2018 at 4:22 AM, Shimi Gersner > wrote: > >> > PCI/e configuration currently does not meets specifications. > >> > > >> > Patch includes various configuration changes to support specifications > >> > - BAR2 to return zero when read and CMD.IOSE is not set. > >> > - Expose NVME configuration through IO space (Optional). > >> > - PCI Power Management v1.2. > >> > - PCIe Function Level Reset. > >> > - Disable QEMUs default use of PCIe Link Status (DLLLA). > >> > - PCIe missing AOC compliance flag. > >> > - Mask PCIe End-to-End TLP as RO (Unspecified by specification). > >> [...] > >> > n->num_namespaces = 1; > >> > n->num_queues = 64; > >> > -n->reg_size = pow2ceil(0x1004 + 2 * (n->num_queues + 1) * 4); > >> > +n->reg_size = pow2ceil(0x2000 + 2 * (n->num_queues + 1) * 4); > >> > >> The existing math looks OK to me (maybe already 4 bytes larger than > >> necessary?). The controller registers should be 0x1000 bytes for the > >> fixed register set, then 2 * 4 bytes for each doorbell pair (+ 1 is > >> for the admin queue, and CAP.DSTRD is set to 0, so there is no extra > >> padding between queues). The result is also rounded up to a power of > >> two, so the result will typically be 8 KB. What is the rationale for > >> this change? > > > > You are correct, this change shouldn't be here. It was made during tests > against the > > nvme compliance which failed here. > > If the NVMe compliance test requires it, that is probably sufficient > reason to change it - although it would be interesting to hear their > rationale. > > > The specification states that bits 0 to 13 are RO, which implicitly > suggests > > that the minimal size of this BAR should be at least 16K as this is a > standard > > way to determine the BAR size on PCI (AFAIK). On the contrary it doesn't > > fit very well with the memory mapped laid out on 3.1 Register Definition > > of the 1.3b nvme spec. Any idea? > > You're right, the MLBAR section of the NVMe spec does seem to indicate > that up to bit 13 should be reserved/RO. This is also probably a good > enough reason to make the change, as long as this reason is provided. > > > > > Additionally it does look like it has an extra 4 bytes. > > > >> > >> > >> > n->ns_size = bs_size / (uint64_t)n->num_namespaces; > >> > > >> > n->namespaces = g_new0(NvmeNamespace, n->num_namespaces); > >> > @@ -1245,6 +1295,10 @@ static void nvme_realize(PCIDevice *pci_dev, > Error **errp) > >> > pci_register_bar(>parent_obj, 0, > >> > PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64, > >> > >iomem); > >> > + > >> > +// Expose the NVME memory through Address Space IO (Optional by > spec) > >> > +pci_register_bar(>parent_obj, 2, PCI_BASE_ADDRESS_SPACE_IO, > >iomem); > >> > >> This looks like it will register the whole 4KB+ NVMe register set > >> (n->iomem) as an I/O space BAR, but this doesn't match the spec (see > >> NVMe 1.3c section 3.2, "Index/Data Pair registers"). The I/O BAR (if > >> implemented) should just contain two 4-byte registers, Index and Data, > >> that can be used to indirectly access the NVMe register set. (Just > >> for curiosity, do you know of any software that uses this feature? It > >> could be useful for testing the implementation.) > > > > Very nice catch. We indeed totally miss-interpreted the specification > here. > > > > We use the compliance suit for sanity, but it doesn't actually use this > feature, > > just verifies the configuration of the registers. > > > > We will go with rolling back this feature as this is optional. > > This is probably the right move; I don't know of any real hardware > devices that implement it (and therefore I doubt any software will > require it). However, it should not be too difficult to add an > implementation of this feature; if you are interested, take a look at > the ahci_idp_read/ahci_idp_write functions - the NVMe index/data pair > should be very similar. > Nice, It does seem easy to implement. We decided to go without due to lack of real-world software we could test against the new facility. > > > Question - I'm having hard time to determine from the specification - As > > this is optional, how device drivers determine if it is available? Does > it > > simply the CMD.IOSE flag from the PCI? And if so, what is > > the standard way in QEMU to disable the flag completely? > > Based on the wording in NVMe 1.3 section 3.2, it seems like the only > indication of
Re: [Qemu-block] [Qemu-devel] [PATCH 1/5] nvme: PCI/e configuration from specification
Hi Shimi, On Sun, Aug 26, 2018 at 2:50 PM Gersner wrote: > > Hi Daniel, > Thanks for taking a look. Comments are inline. > > Gersner. > > On Sun, Jul 15, 2018 at 9:21 AM Daniel Verkamp wrote: >> >> On Fri, Jun 22, 2018 at 4:22 AM, Shimi Gersner wrote: >> > PCI/e configuration currently does not meets specifications. >> > >> > Patch includes various configuration changes to support specifications >> > - BAR2 to return zero when read and CMD.IOSE is not set. >> > - Expose NVME configuration through IO space (Optional). >> > - PCI Power Management v1.2. >> > - PCIe Function Level Reset. >> > - Disable QEMUs default use of PCIe Link Status (DLLLA). >> > - PCIe missing AOC compliance flag. >> > - Mask PCIe End-to-End TLP as RO (Unspecified by specification). >> [...] >> > n->num_namespaces = 1; >> > n->num_queues = 64; >> > -n->reg_size = pow2ceil(0x1004 + 2 * (n->num_queues + 1) * 4); >> > +n->reg_size = pow2ceil(0x2000 + 2 * (n->num_queues + 1) * 4); >> >> The existing math looks OK to me (maybe already 4 bytes larger than >> necessary?). The controller registers should be 0x1000 bytes for the >> fixed register set, then 2 * 4 bytes for each doorbell pair (+ 1 is >> for the admin queue, and CAP.DSTRD is set to 0, so there is no extra >> padding between queues). The result is also rounded up to a power of >> two, so the result will typically be 8 KB. What is the rationale for >> this change? > > You are correct, this change shouldn't be here. It was made during tests > against the > nvme compliance which failed here. If the NVMe compliance test requires it, that is probably sufficient reason to change it - although it would be interesting to hear their rationale. > The specification states that bits 0 to 13 are RO, which implicitly suggests > that the minimal size of this BAR should be at least 16K as this is a standard > way to determine the BAR size on PCI (AFAIK). On the contrary it doesn't > fit very well with the memory mapped laid out on 3.1 Register Definition > of the 1.3b nvme spec. Any idea? You're right, the MLBAR section of the NVMe spec does seem to indicate that up to bit 13 should be reserved/RO. This is also probably a good enough reason to make the change, as long as this reason is provided. > > Additionally it does look like it has an extra 4 bytes. > >> >> >> > n->ns_size = bs_size / (uint64_t)n->num_namespaces; >> > >> > n->namespaces = g_new0(NvmeNamespace, n->num_namespaces); >> > @@ -1245,6 +1295,10 @@ static void nvme_realize(PCIDevice *pci_dev, Error >> > **errp) >> > pci_register_bar(>parent_obj, 0, >> > PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64, >> > >iomem); >> > + >> > +// Expose the NVME memory through Address Space IO (Optional by spec) >> > +pci_register_bar(>parent_obj, 2, PCI_BASE_ADDRESS_SPACE_IO, >> > >iomem); >> >> This looks like it will register the whole 4KB+ NVMe register set >> (n->iomem) as an I/O space BAR, but this doesn't match the spec (see >> NVMe 1.3c section 3.2, "Index/Data Pair registers"). The I/O BAR (if >> implemented) should just contain two 4-byte registers, Index and Data, >> that can be used to indirectly access the NVMe register set. (Just >> for curiosity, do you know of any software that uses this feature? It >> could be useful for testing the implementation.) > > Very nice catch. We indeed totally miss-interpreted the specification here. > > We use the compliance suit for sanity, but it doesn't actually use this > feature, > just verifies the configuration of the registers. > > We will go with rolling back this feature as this is optional. This is probably the right move; I don't know of any real hardware devices that implement it (and therefore I doubt any software will require it). However, it should not be too difficult to add an implementation of this feature; if you are interested, take a look at the ahci_idp_read/ahci_idp_write functions - the NVMe index/data pair should be very similar. > Question - I'm having hard time to determine from the specification - As > this is optional, how device drivers determine if it is available? Does it > simply the CMD.IOSE flag from the PCI? And if so, what is > the standard way in QEMU to disable the flag completely? Based on the wording in NVMe 1.3 section 3.2, it seems like the only indication of support for Index/Data Pair is whether BAR2 is filled out. I believe PCI defines unused BARs to be all 0s, so software should be able to use this to determine whether the feature is provided by a particular NVMe PCIe device, and removing the pci_register_bar() call should be enough to accomplish this from the QEMU side. Thanks, -- Daniel
Re: [Qemu-block] [Qemu-devel] [PATCH 1/5] nvme: PCI/e configuration from specification
Hi Daniel, Thanks for taking a look. Comments are inline. Gersner. On Sun, Jul 15, 2018 at 9:21 AM Daniel Verkamp wrote: > On Fri, Jun 22, 2018 at 4:22 AM, Shimi Gersner wrote: > > PCI/e configuration currently does not meets specifications. > > > > Patch includes various configuration changes to support specifications > > - BAR2 to return zero when read and CMD.IOSE is not set. > > - Expose NVME configuration through IO space (Optional). > > - PCI Power Management v1.2. > > - PCIe Function Level Reset. > > - Disable QEMUs default use of PCIe Link Status (DLLLA). > > - PCIe missing AOC compliance flag. > > - Mask PCIe End-to-End TLP as RO (Unspecified by specification). > [...] > > n->num_namespaces = 1; > > n->num_queues = 64; > > -n->reg_size = pow2ceil(0x1004 + 2 * (n->num_queues + 1) * 4); > > +n->reg_size = pow2ceil(0x2000 + 2 * (n->num_queues + 1) * 4); > > The existing math looks OK to me (maybe already 4 bytes larger than > necessary?). The controller registers should be 0x1000 bytes for the > fixed register set, then 2 * 4 bytes for each doorbell pair (+ 1 is > for the admin queue, and CAP.DSTRD is set to 0, so there is no extra > padding between queues). The result is also rounded up to a power of > two, so the result will typically be 8 KB. What is the rationale for > this change? > You are correct, this change shouldn't be here. It was made during tests against the nvme compliance which failed here. The specification states that bits 0 to 13 are RO, which implicitly suggests that the minimal size of this BAR should be at least 16K as this is a standard way to determine the BAR size on PCI (AFAIK). On the contrary it doesn't fit very well with the memory mapped laid out on 3.1 Register Definition of the 1.3b nvme spec. Any idea? Additionally it does look like it has an extra 4 bytes. > > > n->ns_size = bs_size / (uint64_t)n->num_namespaces; > > > > n->namespaces = g_new0(NvmeNamespace, n->num_namespaces); > > @@ -1245,6 +1295,10 @@ static void nvme_realize(PCIDevice *pci_dev, > Error **errp) > > pci_register_bar(>parent_obj, 0, > > PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64, > > >iomem); > > + > > +// Expose the NVME memory through Address Space IO (Optional by > spec) > > +pci_register_bar(>parent_obj, 2, PCI_BASE_ADDRESS_SPACE_IO, > >iomem); > > This looks like it will register the whole 4KB+ NVMe register set > (n->iomem) as an I/O space BAR, but this doesn't match the spec (see > NVMe 1.3c section 3.2, "Index/Data Pair registers"). The I/O BAR (if > implemented) should just contain two 4-byte registers, Index and Data, > that can be used to indirectly access the NVMe register set. (Just > for curiosity, do you know of any software that uses this feature? It > could be useful for testing the implementation.) > Very nice catch. We indeed totally miss-interpreted the specification here. We use the compliance suit for sanity, but it doesn't actually use this feature, just verifies the configuration of the registers. We will go with rolling back this feature as this is optional. Question - I'm having hard time to determine from the specification - As this is optional, how device drivers determine if it is available? Does it simply the CMD.IOSE flag from the PCI? And if so, what is the standard way in QEMU to disable the flag completely? > > Other minor nitpicks: > - Comment style is inconsistent with the rest of the file (// vs /* */) > - PCIe and NVMe are incorrectly capitalized a few times in comments > Thanks. > > Thanks, > -- Daniel >
Re: [Qemu-block] [Qemu-devel] [PATCH 1/5] nvme: PCI/e configuration from specification
On Fri, Jun 22, 2018 at 4:22 AM, Shimi Gersner wrote: > PCI/e configuration currently does not meets specifications. > > Patch includes various configuration changes to support specifications > - BAR2 to return zero when read and CMD.IOSE is not set. > - Expose NVME configuration through IO space (Optional). > - PCI Power Management v1.2. > - PCIe Function Level Reset. > - Disable QEMUs default use of PCIe Link Status (DLLLA). > - PCIe missing AOC compliance flag. > - Mask PCIe End-to-End TLP as RO (Unspecified by specification). [...] > n->num_namespaces = 1; > n->num_queues = 64; > -n->reg_size = pow2ceil(0x1004 + 2 * (n->num_queues + 1) * 4); > +n->reg_size = pow2ceil(0x2000 + 2 * (n->num_queues + 1) * 4); The existing math looks OK to me (maybe already 4 bytes larger than necessary?). The controller registers should be 0x1000 bytes for the fixed register set, then 2 * 4 bytes for each doorbell pair (+ 1 is for the admin queue, and CAP.DSTRD is set to 0, so there is no extra padding between queues). The result is also rounded up to a power of two, so the result will typically be 8 KB. What is the rationale for this change? > n->ns_size = bs_size / (uint64_t)n->num_namespaces; > > n->namespaces = g_new0(NvmeNamespace, n->num_namespaces); > @@ -1245,6 +1295,10 @@ static void nvme_realize(PCIDevice *pci_dev, Error > **errp) > pci_register_bar(>parent_obj, 0, > PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64, > >iomem); > + > +// Expose the NVME memory through Address Space IO (Optional by spec) > +pci_register_bar(>parent_obj, 2, PCI_BASE_ADDRESS_SPACE_IO, > >iomem); This looks like it will register the whole 4KB+ NVMe register set (n->iomem) as an I/O space BAR, but this doesn't match the spec (see NVMe 1.3c section 3.2, "Index/Data Pair registers"). The I/O BAR (if implemented) should just contain two 4-byte registers, Index and Data, that can be used to indirectly access the NVMe register set. (Just for curiosity, do you know of any software that uses this feature? It could be useful for testing the implementation.) Other minor nitpicks: - Comment style is inconsistent with the rest of the file (// vs /* */) - PCIe and NVMe are incorrectly capitalized a few times in comments Thanks, -- Daniel