Re: [Qemu-block] [Qemu-devel] [PATCH 1/5] nvme: PCI/e configuration from specification

2018-09-12 Thread Eric Blake

On 9/12/18 2:53 PM, Gersner wrote:

Hi Daniel,

Sorry for the long round-trips, we had a busy month. We have implemented
all the changes. Waiting for a final clarification.

Should the new patches be posted on this thread or a new one?


Best to post a v2 as a new top-level thread (our CI tools don't spot 
patch revisions buried in reply to an existing thread).


Also, it's best to avoid top-posting on technical lists, as it's harder 
to follow the flow of your email.



--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [Qemu-devel] [PATCH 1/5] nvme: PCI/e configuration from specification

2018-09-12 Thread Gersner
Hi Daniel,

Sorry for the long round-trips, we had a busy month. We have implemented
all the changes. Waiting for a final clarification.

Should the new patches be posted on this thread or a new one?

Thanks for you time.

Gersner.

On Thu, Aug 30, 2018 at 6:45 PM Daniel Verkamp  wrote:

> Hi Shimi,
>
> On Sun, Aug 26, 2018 at 2:50 PM Gersner  wrote:
> >
> > Hi Daniel,
> > Thanks for taking a look. Comments are inline.
> >
> > Gersner.
> >
> > On Sun, Jul 15, 2018 at 9:21 AM Daniel Verkamp  wrote:
> >>
> >> On Fri, Jun 22, 2018 at 4:22 AM, Shimi Gersner 
> wrote:
> >> > PCI/e configuration currently does not meets specifications.
> >> >
> >> > Patch includes various configuration changes to support specifications
> >> > - BAR2 to return zero when read and CMD.IOSE is not set.
> >> > - Expose NVME configuration through IO space (Optional).
> >> > - PCI Power Management v1.2.
> >> > - PCIe Function Level Reset.
> >> > - Disable QEMUs default use of PCIe Link Status (DLLLA).
> >> > - PCIe missing AOC compliance flag.
> >> > - Mask PCIe End-to-End TLP as RO (Unspecified by specification).
> >> [...]
> >> >  n->num_namespaces = 1;
> >> >  n->num_queues = 64;
> >> > -n->reg_size = pow2ceil(0x1004 + 2 * (n->num_queues + 1) * 4);
> >> > +n->reg_size = pow2ceil(0x2000 + 2 * (n->num_queues + 1) * 4);
> >>
> >> The existing math looks OK to me (maybe already 4 bytes larger than
> >> necessary?). The controller registers should be 0x1000 bytes for the
> >> fixed register set, then 2 * 4 bytes for each doorbell pair (+ 1 is
> >> for the admin queue, and CAP.DSTRD is set to 0, so there is no extra
> >> padding between queues). The result is also rounded up to a power of
> >> two, so the result will typically be 8 KB.  What is the rationale for
> >> this change?
> >
> > You are correct, this change shouldn't be here. It was made during tests
> against the
> > nvme compliance which failed here.
>
> If the NVMe compliance test requires it, that is probably sufficient
> reason to change it - although it would be interesting to hear their
> rationale.
>
> > The specification states that bits 0 to 13 are RO, which implicitly
> suggests
> > that the minimal size of this BAR should be at least 16K as this is a
> standard
> > way to determine the BAR size on PCI (AFAIK). On the contrary it doesn't
> > fit very well with the memory mapped laid out on 3.1 Register Definition
> > of the 1.3b nvme spec. Any idea?
>
> You're right, the MLBAR section of the NVMe spec does seem to indicate
> that up to bit 13 should be reserved/RO.  This is also probably a good
> enough reason to make the change, as long as this reason is provided.
>
> >
> > Additionally it does look like it has an extra 4 bytes.
> >
> >>
> >>
> >> >  n->ns_size = bs_size / (uint64_t)n->num_namespaces;
> >> >
> >> >  n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
> >> > @@ -1245,6 +1295,10 @@ static void nvme_realize(PCIDevice *pci_dev,
> Error **errp)
> >> >  pci_register_bar(>parent_obj, 0,
> >> >  PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64,
> >> >  >iomem);
> >> > +
> >> > +// Expose the NVME memory through Address Space IO (Optional by
> spec)
> >> > +pci_register_bar(>parent_obj, 2, PCI_BASE_ADDRESS_SPACE_IO,
> >iomem);
> >>
> >> This looks like it will register the whole 4KB+ NVMe register set
> >> (n->iomem) as an I/O space BAR, but this doesn't match the spec (see
> >> NVMe 1.3c section 3.2, "Index/Data Pair registers"). The I/O BAR (if
> >> implemented) should just contain two 4-byte registers, Index and Data,
> >> that can be used to indirectly access the NVMe register set.  (Just
> >> for curiosity, do you know of any software that uses this feature? It
> >> could be useful for testing the implementation.)
> >
> > Very nice catch. We indeed totally miss-interpreted the specification
> here.
> >
> > We use the compliance suit for sanity, but it doesn't actually use this
> feature,
> > just verifies the configuration of the registers.
> >
> > We will go with rolling back this feature as this is optional.
>
> This is probably the right move; I don't know of any real hardware
> devices that implement it (and therefore I doubt any software will
> require it).  However, it should not be too difficult to add an
> implementation of this feature; if you are interested, take a look at
> the ahci_idp_read/ahci_idp_write functions - the NVMe index/data pair
> should be very similar.
>
Nice, It does seem easy to implement. We decided to go without due to lack
of real-world software we could test against the new facility.


>
> > Question - I'm having hard time to determine from the specification - As
> > this is optional, how device drivers determine if it is available? Does
> it
> > simply the CMD.IOSE flag from the PCI? And if so, what is
> > the standard way in QEMU to disable the flag completely?
>
> Based on the wording in NVMe 1.3 section 3.2, it seems like the only
> indication of 

Re: [Qemu-block] [Qemu-devel] [PATCH 1/5] nvme: PCI/e configuration from specification

2018-08-30 Thread Daniel Verkamp
Hi Shimi,

On Sun, Aug 26, 2018 at 2:50 PM Gersner  wrote:
>
> Hi Daniel,
> Thanks for taking a look. Comments are inline.
>
> Gersner.
>
> On Sun, Jul 15, 2018 at 9:21 AM Daniel Verkamp  wrote:
>>
>> On Fri, Jun 22, 2018 at 4:22 AM, Shimi Gersner  wrote:
>> > PCI/e configuration currently does not meets specifications.
>> >
>> > Patch includes various configuration changes to support specifications
>> > - BAR2 to return zero when read and CMD.IOSE is not set.
>> > - Expose NVME configuration through IO space (Optional).
>> > - PCI Power Management v1.2.
>> > - PCIe Function Level Reset.
>> > - Disable QEMUs default use of PCIe Link Status (DLLLA).
>> > - PCIe missing AOC compliance flag.
>> > - Mask PCIe End-to-End TLP as RO (Unspecified by specification).
>> [...]
>> >  n->num_namespaces = 1;
>> >  n->num_queues = 64;
>> > -n->reg_size = pow2ceil(0x1004 + 2 * (n->num_queues + 1) * 4);
>> > +n->reg_size = pow2ceil(0x2000 + 2 * (n->num_queues + 1) * 4);
>>
>> The existing math looks OK to me (maybe already 4 bytes larger than
>> necessary?). The controller registers should be 0x1000 bytes for the
>> fixed register set, then 2 * 4 bytes for each doorbell pair (+ 1 is
>> for the admin queue, and CAP.DSTRD is set to 0, so there is no extra
>> padding between queues). The result is also rounded up to a power of
>> two, so the result will typically be 8 KB.  What is the rationale for
>> this change?
>
> You are correct, this change shouldn't be here. It was made during tests 
> against the
> nvme compliance which failed here.

If the NVMe compliance test requires it, that is probably sufficient
reason to change it - although it would be interesting to hear their
rationale.

> The specification states that bits 0 to 13 are RO, which implicitly suggests
> that the minimal size of this BAR should be at least 16K as this is a standard
> way to determine the BAR size on PCI (AFAIK). On the contrary it doesn't
> fit very well with the memory mapped laid out on 3.1 Register Definition
> of the 1.3b nvme spec. Any idea?

You're right, the MLBAR section of the NVMe spec does seem to indicate
that up to bit 13 should be reserved/RO.  This is also probably a good
enough reason to make the change, as long as this reason is provided.

>
> Additionally it does look like it has an extra 4 bytes.
>
>>
>>
>> >  n->ns_size = bs_size / (uint64_t)n->num_namespaces;
>> >
>> >  n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
>> > @@ -1245,6 +1295,10 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
>> > **errp)
>> >  pci_register_bar(>parent_obj, 0,
>> >  PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64,
>> >  >iomem);
>> > +
>> > +// Expose the NVME memory through Address Space IO (Optional by spec)
>> > +pci_register_bar(>parent_obj, 2, PCI_BASE_ADDRESS_SPACE_IO, 
>> > >iomem);
>>
>> This looks like it will register the whole 4KB+ NVMe register set
>> (n->iomem) as an I/O space BAR, but this doesn't match the spec (see
>> NVMe 1.3c section 3.2, "Index/Data Pair registers"). The I/O BAR (if
>> implemented) should just contain two 4-byte registers, Index and Data,
>> that can be used to indirectly access the NVMe register set.  (Just
>> for curiosity, do you know of any software that uses this feature? It
>> could be useful for testing the implementation.)
>
> Very nice catch. We indeed totally miss-interpreted the specification here.
>
> We use the compliance suit for sanity, but it doesn't actually use this 
> feature,
> just verifies the configuration of the registers.
>
> We will go with rolling back this feature as this is optional.

This is probably the right move; I don't know of any real hardware
devices that implement it (and therefore I doubt any software will
require it).  However, it should not be too difficult to add an
implementation of this feature; if you are interested, take a look at
the ahci_idp_read/ahci_idp_write functions - the NVMe index/data pair
should be very similar.

> Question - I'm having hard time to determine from the specification - As
> this is optional, how device drivers determine if it is available? Does it
> simply the CMD.IOSE flag from the PCI? And if so, what is
> the standard way in QEMU to disable the flag completely?

Based on the wording in NVMe 1.3 section 3.2, it seems like the only
indication of support for Index/Data Pair is whether BAR2 is filled
out.  I believe PCI defines unused BARs to be all 0s, so software
should be able to use this to determine whether the feature is
provided by a particular NVMe PCIe device, and removing the
pci_register_bar() call should be enough to accomplish this from the
QEMU side.

Thanks,
-- Daniel



Re: [Qemu-block] [Qemu-devel] [PATCH 1/5] nvme: PCI/e configuration from specification

2018-08-26 Thread Gersner
Hi Daniel,
Thanks for taking a look. Comments are inline.

Gersner.

On Sun, Jul 15, 2018 at 9:21 AM Daniel Verkamp  wrote:

> On Fri, Jun 22, 2018 at 4:22 AM, Shimi Gersner  wrote:
> > PCI/e configuration currently does not meets specifications.
> >
> > Patch includes various configuration changes to support specifications
> > - BAR2 to return zero when read and CMD.IOSE is not set.
> > - Expose NVME configuration through IO space (Optional).
> > - PCI Power Management v1.2.
> > - PCIe Function Level Reset.
> > - Disable QEMUs default use of PCIe Link Status (DLLLA).
> > - PCIe missing AOC compliance flag.
> > - Mask PCIe End-to-End TLP as RO (Unspecified by specification).
> [...]
> >  n->num_namespaces = 1;
> >  n->num_queues = 64;
> > -n->reg_size = pow2ceil(0x1004 + 2 * (n->num_queues + 1) * 4);
> > +n->reg_size = pow2ceil(0x2000 + 2 * (n->num_queues + 1) * 4);
>
> The existing math looks OK to me (maybe already 4 bytes larger than
> necessary?). The controller registers should be 0x1000 bytes for the
> fixed register set, then 2 * 4 bytes for each doorbell pair (+ 1 is
> for the admin queue, and CAP.DSTRD is set to 0, so there is no extra
> padding between queues). The result is also rounded up to a power of
> two, so the result will typically be 8 KB.  What is the rationale for
> this change?
>
You are correct, this change shouldn't be here. It was made during tests
against the
nvme compliance which failed here.

The specification states that bits 0 to 13 are RO, which implicitly suggests
that the minimal size of this BAR should be at least 16K as this is a
standard
way to determine the BAR size on PCI (AFAIK). On the contrary it doesn't
fit very well with the memory mapped laid out on 3.1 Register Definition
of the 1.3b nvme spec. Any idea?

Additionally it does look like it has an extra 4 bytes.


>
> >  n->ns_size = bs_size / (uint64_t)n->num_namespaces;
> >
> >  n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
> > @@ -1245,6 +1295,10 @@ static void nvme_realize(PCIDevice *pci_dev,
> Error **errp)
> >  pci_register_bar(>parent_obj, 0,
> >  PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64,
> >  >iomem);
> > +
> > +// Expose the NVME memory through Address Space IO (Optional by
> spec)
> > +pci_register_bar(>parent_obj, 2, PCI_BASE_ADDRESS_SPACE_IO,
> >iomem);
>
> This looks like it will register the whole 4KB+ NVMe register set
> (n->iomem) as an I/O space BAR, but this doesn't match the spec (see
> NVMe 1.3c section 3.2, "Index/Data Pair registers"). The I/O BAR (if
> implemented) should just contain two 4-byte registers, Index and Data,
> that can be used to indirectly access the NVMe register set.  (Just
> for curiosity, do you know of any software that uses this feature? It
> could be useful for testing the implementation.)
>
Very nice catch. We indeed totally miss-interpreted the specification here.

We use the compliance suit for sanity, but it doesn't actually use this
feature,
just verifies the configuration of the registers.

We will go with rolling back this feature as this is optional.

Question - I'm having hard time to determine from the specification - As
this is optional, how device drivers determine if it is available? Does it
simply the CMD.IOSE flag from the PCI? And if so, what is
the standard way in QEMU to disable the flag completely?


>
> Other minor nitpicks:
> - Comment style is inconsistent with the rest of the file (// vs /* */)
> - PCIe and NVMe are incorrectly capitalized a few times in comments
>

Thanks.


>
> Thanks,
> -- Daniel
>


Re: [Qemu-block] [Qemu-devel] [PATCH 1/5] nvme: PCI/e configuration from specification

2018-07-15 Thread Daniel Verkamp
On Fri, Jun 22, 2018 at 4:22 AM, Shimi Gersner  wrote:
> PCI/e configuration currently does not meets specifications.
>
> Patch includes various configuration changes to support specifications
> - BAR2 to return zero when read and CMD.IOSE is not set.
> - Expose NVME configuration through IO space (Optional).
> - PCI Power Management v1.2.
> - PCIe Function Level Reset.
> - Disable QEMUs default use of PCIe Link Status (DLLLA).
> - PCIe missing AOC compliance flag.
> - Mask PCIe End-to-End TLP as RO (Unspecified by specification).
[...]
>  n->num_namespaces = 1;
>  n->num_queues = 64;
> -n->reg_size = pow2ceil(0x1004 + 2 * (n->num_queues + 1) * 4);
> +n->reg_size = pow2ceil(0x2000 + 2 * (n->num_queues + 1) * 4);

The existing math looks OK to me (maybe already 4 bytes larger than
necessary?). The controller registers should be 0x1000 bytes for the
fixed register set, then 2 * 4 bytes for each doorbell pair (+ 1 is
for the admin queue, and CAP.DSTRD is set to 0, so there is no extra
padding between queues). The result is also rounded up to a power of
two, so the result will typically be 8 KB.  What is the rationale for
this change?

>  n->ns_size = bs_size / (uint64_t)n->num_namespaces;
>
>  n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
> @@ -1245,6 +1295,10 @@ static void nvme_realize(PCIDevice *pci_dev, Error 
> **errp)
>  pci_register_bar(>parent_obj, 0,
>  PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64,
>  >iomem);
> +
> +// Expose the NVME memory through Address Space IO (Optional by spec)
> +pci_register_bar(>parent_obj, 2, PCI_BASE_ADDRESS_SPACE_IO, 
> >iomem);

This looks like it will register the whole 4KB+ NVMe register set
(n->iomem) as an I/O space BAR, but this doesn't match the spec (see
NVMe 1.3c section 3.2, "Index/Data Pair registers"). The I/O BAR (if
implemented) should just contain two 4-byte registers, Index and Data,
that can be used to indirectly access the NVMe register set.  (Just
for curiosity, do you know of any software that uses this feature? It
could be useful for testing the implementation.)

Other minor nitpicks:
- Comment style is inconsistent with the rest of the file (// vs /* */)
- PCIe and NVMe are incorrectly capitalized a few times in comments

Thanks,
-- Daniel