Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device

2020-07-22 Thread Andrzej Jakowski
On 7/22/20 10:21 AM, Klaus Jensen wrote:
> On Jul 22 10:00, Andrzej Jakowski wrote:
>> On 7/22/20 12:43 AM, Klaus Jensen wrote:
>>> @keith, please see below - can you comment on the Linux kernel 2 MB
>>> boundary requirement for the CMB? Or should we hail Stephen (or Logan
>>> maybe) since this seems to be related to p2pdma?
>>>
>>> On Jul 21 14:54, Andrzej Jakowski wrote:
 On 7/15/20 1:06 AM, Klaus Jensen wrote:
> Hi Andrzej,
>
> I've not been ignoring this, but sorry for not following up earlier.
>
> I'm hesitent to merge anything that very obviously breaks an OS that we
> know is used a lot to this using this device. Also because the issue has
> not been analyzed well enough to actually know if this is a QEMU or
> kernel issue.

 Hi Klaus,

 Thx for your response! I understand your hesitance on merging stuff that
 obviously breaks guest OS. 

>
> Now, as far as I can test, having the MSI-X vector table and PBA in BAR
> 0, PMR in BAR 2 and CMB in BAR 4 seems to make everyone happy
> (irregardless of IOMMU on/off).
>
> Later, when the issue is better understood, we can add options to set
> offsets, BIRs etc.
>
> The patch below replaces your "[PATCH v4 2/2] nvme: allow cmb and pmr to
> be enabled" (but still requires "[PATCH v4 1/2] ...") and applies to
> git://git.infradead.org/qemu-nvme.git nvme-next branch.
>
> Can you reproduce the issues with that patch? I can't on a stock Arch
> Linux 5.7.5-arch1-1 kernel.

 While I'm happy that approach with MSIX and PBA in BAR0 works fine, I
 feel that investigation part why it works while mine doesn't is
 missing. It looks to me that both patches are basically following same 
 approach: create memory subregion and overlay on top of other memory
 region. Why one works and the other doesn't then?

 Having in mind that, I have recently focused on understanding problem.
 I observed that when guest assigns address to BAR4, addr field in
 nvme-bar4 memory region gets populated, but it doesn't get automatically
 populated in ctrl_mem (cmb) memory subregion, so later when 
 nvme_addr_is_cmb() 
 is called address check works incorrectly and as a consequence vmm does 
 dma 
 read instead of memcpy.
 I created a patch that sets correct address on ctrl_mem subregion and 
 guest 
 OS boots up correctly.

 When I looked into pci and memory region code I noticed that indeed address
 is only assigned to top level memory region but not to contained 
 subregions.
 I think that because in your approach cmb grabs whole bar exclusively it 
 works
 fine.

 Here is my question (perhaps pci folks can help answer :)): if we consider 
 memory region overlapping for pci devices as valid use case should pci 
 code on configuration cycles walk through all contained subregion and
 update addr field accordingly?

 Thx!

>>>
>>> Hi Andrzej,
>>>
>>> Thanks for looking into this. I think your analysis helped me nail this.
>>> The problem is that we added the use of a subregion and have some
>>> assumptions that no longer hold.
>>>
>>> nvme_addr_is_cmb() assumes that n->ctrl_mem.addr is an absolute address.
>>> But when the memory region is a subregion, addr holds an offset into the
>>> parent container instead. Thus, changing all occurances of
>>> n->ctrl_mem.addr to (n->bar0.addr + n->ctrl_mem.addr) fixes the issue
>>> (this is required in nvme_addr_is_cmb and nvme_map_prp). I patched that
>>> in your original patch[1]. The reason my version worked is because there
>>> was no subregion involved for the CMB, so the existing address
>>> validation calculations were still correct.
>>
>> I'm a little bit concerned with this approach:
>> (n->bar0.addr + n->ctrl_mem.addr) and hoping to have some debate. Let me 
>> describe my understanding of the problem.
> 
> Oh. In the context of your patch I meant bar4 of course, but anyway.
> 
>> It looks to me that addr field sometimes contains *absolute* address (when 
>> no 
>> hierarchy is used) and other times it contains *relative* address (when
>> hierarchy is created). From my perspective use of this field is inconsistent
>> and thus error-prone.  
>> Because of that I think that doing n->bar0.addr + n->ctrl_mem.addr doesn't
>> solve root problem and is still prone to the same problem if in the future
>> we potentially build even more complicated hierarchy.
>> I think that we could solve it by introducing helper function like
>>
>> hwaddr memory_region_get_abs_addr(MemoryRegion *mr) 
>>
>> to retrieve absolute address and in the documentation indicate that addr 
>> field
>> can be relative or absolute and it is recommended to use above function to 
>> retrieve absolute address.
>> What do you think?
>>
> 
> I'm all for a helper - I was not gonna cheer for the quick'n'dirty fix I
> did just to convince myself 

Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device

2020-07-22 Thread Klaus Jensen
On Jul 22 10:00, Andrzej Jakowski wrote:
> On 7/22/20 12:43 AM, Klaus Jensen wrote:
> > @keith, please see below - can you comment on the Linux kernel 2 MB
> > boundary requirement for the CMB? Or should we hail Stephen (or Logan
> > maybe) since this seems to be related to p2pdma?
> > 
> > On Jul 21 14:54, Andrzej Jakowski wrote:
> >> On 7/15/20 1:06 AM, Klaus Jensen wrote:
> >>> Hi Andrzej,
> >>>
> >>> I've not been ignoring this, but sorry for not following up earlier.
> >>>
> >>> I'm hesitent to merge anything that very obviously breaks an OS that we
> >>> know is used a lot to this using this device. Also because the issue has
> >>> not been analyzed well enough to actually know if this is a QEMU or
> >>> kernel issue.
> >>
> >> Hi Klaus,
> >>
> >> Thx for your response! I understand your hesitance on merging stuff that
> >> obviously breaks guest OS. 
> >>
> >>>
> >>> Now, as far as I can test, having the MSI-X vector table and PBA in BAR
> >>> 0, PMR in BAR 2 and CMB in BAR 4 seems to make everyone happy
> >>> (irregardless of IOMMU on/off).
> >>>
> >>> Later, when the issue is better understood, we can add options to set
> >>> offsets, BIRs etc.
> >>>
> >>> The patch below replaces your "[PATCH v4 2/2] nvme: allow cmb and pmr to
> >>> be enabled" (but still requires "[PATCH v4 1/2] ...") and applies to
> >>> git://git.infradead.org/qemu-nvme.git nvme-next branch.
> >>>
> >>> Can you reproduce the issues with that patch? I can't on a stock Arch
> >>> Linux 5.7.5-arch1-1 kernel.
> >>
> >> While I'm happy that approach with MSIX and PBA in BAR0 works fine, I
> >> feel that investigation part why it works while mine doesn't is
> >> missing. It looks to me that both patches are basically following same 
> >> approach: create memory subregion and overlay on top of other memory
> >> region. Why one works and the other doesn't then?
> >>
> >> Having in mind that, I have recently focused on understanding problem.
> >> I observed that when guest assigns address to BAR4, addr field in
> >> nvme-bar4 memory region gets populated, but it doesn't get automatically
> >> populated in ctrl_mem (cmb) memory subregion, so later when 
> >> nvme_addr_is_cmb() 
> >> is called address check works incorrectly and as a consequence vmm does 
> >> dma 
> >> read instead of memcpy.
> >> I created a patch that sets correct address on ctrl_mem subregion and 
> >> guest 
> >> OS boots up correctly.
> >>
> >> When I looked into pci and memory region code I noticed that indeed address
> >> is only assigned to top level memory region but not to contained 
> >> subregions.
> >> I think that because in your approach cmb grabs whole bar exclusively it 
> >> works
> >> fine.
> >>
> >> Here is my question (perhaps pci folks can help answer :)): if we consider 
> >> memory region overlapping for pci devices as valid use case should pci 
> >> code on configuration cycles walk through all contained subregion and
> >> update addr field accordingly?
> >>
> >> Thx!
> >>
> > 
> > Hi Andrzej,
> > 
> > Thanks for looking into this. I think your analysis helped me nail this.
> > The problem is that we added the use of a subregion and have some
> > assumptions that no longer hold.
> > 
> > nvme_addr_is_cmb() assumes that n->ctrl_mem.addr is an absolute address.
> > But when the memory region is a subregion, addr holds an offset into the
> > parent container instead. Thus, changing all occurances of
> > n->ctrl_mem.addr to (n->bar0.addr + n->ctrl_mem.addr) fixes the issue
> > (this is required in nvme_addr_is_cmb and nvme_map_prp). I patched that
> > in your original patch[1]. The reason my version worked is because there
> > was no subregion involved for the CMB, so the existing address
> > validation calculations were still correct.
> 
> I'm a little bit concerned with this approach:
> (n->bar0.addr + n->ctrl_mem.addr) and hoping to have some debate. Let me 
> describe my understanding of the problem.

Oh. In the context of your patch I meant bar4 of course, but anyway.

> It looks to me that addr field sometimes contains *absolute* address (when no 
> hierarchy is used) and other times it contains *relative* address (when
> hierarchy is created). From my perspective use of this field is inconsistent
> and thus error-prone.  
> Because of that I think that doing n->bar0.addr + n->ctrl_mem.addr doesn't
> solve root problem and is still prone to the same problem if in the future
> we potentially build even more complicated hierarchy.
> I think that we could solve it by introducing helper function like
> 
> hwaddr memory_region_get_abs_addr(MemoryRegion *mr) 
> 
> to retrieve absolute address and in the documentation indicate that addr field
> can be relative or absolute and it is recommended to use above function to 
> retrieve absolute address.
> What do you think?
> 

I'm all for a helper - I was not gonna cheer for the quick'n'dirty fix I
did just to convince myself that this was the issue ;)

I think the helper might already be there in 

Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device

2020-07-22 Thread Andrzej Jakowski
On 7/22/20 12:43 AM, Klaus Jensen wrote:
> @keith, please see below - can you comment on the Linux kernel 2 MB
> boundary requirement for the CMB? Or should we hail Stephen (or Logan
> maybe) since this seems to be related to p2pdma?
> 
> On Jul 21 14:54, Andrzej Jakowski wrote:
>> On 7/15/20 1:06 AM, Klaus Jensen wrote:
>>> Hi Andrzej,
>>>
>>> I've not been ignoring this, but sorry for not following up earlier.
>>>
>>> I'm hesitent to merge anything that very obviously breaks an OS that we
>>> know is used a lot to this using this device. Also because the issue has
>>> not been analyzed well enough to actually know if this is a QEMU or
>>> kernel issue.
>>
>> Hi Klaus,
>>
>> Thx for your response! I understand your hesitance on merging stuff that
>> obviously breaks guest OS. 
>>
>>>
>>> Now, as far as I can test, having the MSI-X vector table and PBA in BAR
>>> 0, PMR in BAR 2 and CMB in BAR 4 seems to make everyone happy
>>> (irregardless of IOMMU on/off).
>>>
>>> Later, when the issue is better understood, we can add options to set
>>> offsets, BIRs etc.
>>>
>>> The patch below replaces your "[PATCH v4 2/2] nvme: allow cmb and pmr to
>>> be enabled" (but still requires "[PATCH v4 1/2] ...") and applies to
>>> git://git.infradead.org/qemu-nvme.git nvme-next branch.
>>>
>>> Can you reproduce the issues with that patch? I can't on a stock Arch
>>> Linux 5.7.5-arch1-1 kernel.
>>
>> While I'm happy that approach with MSIX and PBA in BAR0 works fine, I
>> feel that investigation part why it works while mine doesn't is
>> missing. It looks to me that both patches are basically following same 
>> approach: create memory subregion and overlay on top of other memory
>> region. Why one works and the other doesn't then?
>>
>> Having in mind that, I have recently focused on understanding problem.
>> I observed that when guest assigns address to BAR4, addr field in
>> nvme-bar4 memory region gets populated, but it doesn't get automatically
>> populated in ctrl_mem (cmb) memory subregion, so later when 
>> nvme_addr_is_cmb() 
>> is called address check works incorrectly and as a consequence vmm does dma 
>> read instead of memcpy.
>> I created a patch that sets correct address on ctrl_mem subregion and guest 
>> OS boots up correctly.
>>
>> When I looked into pci and memory region code I noticed that indeed address
>> is only assigned to top level memory region but not to contained subregions.
>> I think that because in your approach cmb grabs whole bar exclusively it 
>> works
>> fine.
>>
>> Here is my question (perhaps pci folks can help answer :)): if we consider 
>> memory region overlapping for pci devices as valid use case should pci 
>> code on configuration cycles walk through all contained subregion and
>> update addr field accordingly?
>>
>> Thx!
>>
> 
> Hi Andrzej,
> 
> Thanks for looking into this. I think your analysis helped me nail this.
> The problem is that we added the use of a subregion and have some
> assumptions that no longer hold.
> 
> nvme_addr_is_cmb() assumes that n->ctrl_mem.addr is an absolute address.
> But when the memory region is a subregion, addr holds an offset into the
> parent container instead. Thus, changing all occurances of
> n->ctrl_mem.addr to (n->bar0.addr + n->ctrl_mem.addr) fixes the issue
> (this is required in nvme_addr_is_cmb and nvme_map_prp). I patched that
> in your original patch[1]. The reason my version worked is because there
> was no subregion involved for the CMB, so the existing address
> validation calculations were still correct.

I'm a little bit concerned with this approach:
(n->bar0.addr + n->ctrl_mem.addr) and hoping to have some debate. Let me 
describe my understanding of the problem.
It looks to me that addr field sometimes contains *absolute* address (when no 
hierarchy is used) and other times it contains *relative* address (when
hierarchy is created). From my perspective use of this field is inconsistent
and thus error-prone.  
Because of that I think that doing n->bar0.addr + n->ctrl_mem.addr doesn't
solve root problem and is still prone to the same problem if in the future
we potentially build even more complicated hierarchy.
I think that we could solve it by introducing helper function like

hwaddr memory_region_get_abs_addr(MemoryRegion *mr) 

to retrieve absolute address and in the documentation indicate that addr field
can be relative or absolute and it is recommended to use above function to 
retrieve absolute address.
What do you think?

> 
> This leaves us with the Linux kernel complaining about not being able to
> register the CMB if it is not on a 2MB boundary - this is probably just
> a constraint in the kernel that we can't do anything about (but I'm no
> kernel hacker...), which can be fixed by either being "nice" towards the
> Linux kernel by forcing a 2 MB alignment in the device or exposing the
> SZU field such that the user can choose 16MiB size units (or higher)
> instead of 1MiB. I'm leaning towards ensuring the 2 MB alignment 

Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device

2020-07-22 Thread Klaus Jensen
@keith, please see below - can you comment on the Linux kernel 2 MB
boundary requirement for the CMB? Or should we hail Stephen (or Logan
maybe) since this seems to be related to p2pdma?

On Jul 21 14:54, Andrzej Jakowski wrote:
> On 7/15/20 1:06 AM, Klaus Jensen wrote:
> > Hi Andrzej,
> > 
> > I've not been ignoring this, but sorry for not following up earlier.
> > 
> > I'm hesitent to merge anything that very obviously breaks an OS that we
> > know is used a lot to this using this device. Also because the issue has
> > not been analyzed well enough to actually know if this is a QEMU or
> > kernel issue.
> 
> Hi Klaus,
> 
> Thx for your response! I understand your hesitance on merging stuff that
> obviously breaks guest OS. 
> 
> > 
> > Now, as far as I can test, having the MSI-X vector table and PBA in BAR
> > 0, PMR in BAR 2 and CMB in BAR 4 seems to make everyone happy
> > (irregardless of IOMMU on/off).
> > 
> > Later, when the issue is better understood, we can add options to set
> > offsets, BIRs etc.
> > 
> > The patch below replaces your "[PATCH v4 2/2] nvme: allow cmb and pmr to
> > be enabled" (but still requires "[PATCH v4 1/2] ...") and applies to
> > git://git.infradead.org/qemu-nvme.git nvme-next branch.
> > 
> > Can you reproduce the issues with that patch? I can't on a stock Arch
> > Linux 5.7.5-arch1-1 kernel.
> 
> While I'm happy that approach with MSIX and PBA in BAR0 works fine, I
> feel that investigation part why it works while mine doesn't is
> missing. It looks to me that both patches are basically following same 
> approach: create memory subregion and overlay on top of other memory
> region. Why one works and the other doesn't then?
> 
> Having in mind that, I have recently focused on understanding problem.
> I observed that when guest assigns address to BAR4, addr field in
> nvme-bar4 memory region gets populated, but it doesn't get automatically
> populated in ctrl_mem (cmb) memory subregion, so later when 
> nvme_addr_is_cmb() 
> is called address check works incorrectly and as a consequence vmm does dma 
> read instead of memcpy.
> I created a patch that sets correct address on ctrl_mem subregion and guest 
> OS boots up correctly.
> 
> When I looked into pci and memory region code I noticed that indeed address
> is only assigned to top level memory region but not to contained subregions.
> I think that because in your approach cmb grabs whole bar exclusively it works
> fine.
> 
> Here is my question (perhaps pci folks can help answer :)): if we consider 
> memory region overlapping for pci devices as valid use case should pci 
> code on configuration cycles walk through all contained subregion and
> update addr field accordingly?
> 
> Thx!
> 

Hi Andrzej,

Thanks for looking into this. I think your analysis helped me nail this.
The problem is that we added the use of a subregion and have some
assumptions that no longer hold.

nvme_addr_is_cmb() assumes that n->ctrl_mem.addr is an absolute address.
But when the memory region is a subregion, addr holds an offset into the
parent container instead. Thus, changing all occurances of
n->ctrl_mem.addr to (n->bar0.addr + n->ctrl_mem.addr) fixes the issue
(this is required in nvme_addr_is_cmb and nvme_map_prp). I patched that
in your original patch[1]. The reason my version worked is because there
was no subregion involved for the CMB, so the existing address
validation calculations were still correct.

This leaves us with the Linux kernel complaining about not being able to
register the CMB if it is not on a 2MB boundary - this is probably just
a constraint in the kernel that we can't do anything about (but I'm no
kernel hacker...), which can be fixed by either being "nice" towards the
Linux kernel by forcing a 2 MB alignment in the device or exposing the
SZU field such that the user can choose 16MiB size units (or higher)
instead of 1MiB. I'm leaning towards ensuring the 2 MB alignment in the
device such that we do not have to introduce new cmb_size parameters,
while also making it easier for the user to configure. But I'm not
really sure.

  [1]: Message-Id: <20200701214858.28515-3-andrzej.jakow...@linux.intel.com>



Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device

2020-07-21 Thread Andrzej Jakowski
On 7/15/20 1:06 AM, Klaus Jensen wrote:
> Hi Andrzej,
> 
> I've not been ignoring this, but sorry for not following up earlier.
> 
> I'm hesitent to merge anything that very obviously breaks an OS that we
> know is used a lot to this using this device. Also because the issue has
> not been analyzed well enough to actually know if this is a QEMU or
> kernel issue.

Hi Klaus,

Thx for your response! I understand your hesitance on merging stuff that
obviously breaks guest OS. 

> 
> Now, as far as I can test, having the MSI-X vector table and PBA in BAR
> 0, PMR in BAR 2 and CMB in BAR 4 seems to make everyone happy
> (irregardless of IOMMU on/off).
> 
> Later, when the issue is better understood, we can add options to set
> offsets, BIRs etc.
> 
> The patch below replaces your "[PATCH v4 2/2] nvme: allow cmb and pmr to
> be enabled" (but still requires "[PATCH v4 1/2] ...") and applies to
> git://git.infradead.org/qemu-nvme.git nvme-next branch.
> 
> Can you reproduce the issues with that patch? I can't on a stock Arch
> Linux 5.7.5-arch1-1 kernel.

While I'm happy that approach with MSIX and PBA in BAR0 works fine, I
feel that investigation part why it works while mine doesn't is
missing. It looks to me that both patches are basically following same 
approach: create memory subregion and overlay on top of other memory
region. Why one works and the other doesn't then?

Having in mind that, I have recently focused on understanding problem.
I observed that when guest assigns address to BAR4, addr field in
nvme-bar4 memory region gets populated, but it doesn't get automatically
populated in ctrl_mem (cmb) memory subregion, so later when nvme_addr_is_cmb() 
is called address check works incorrectly and as a consequence vmm does dma 
read instead of memcpy.
I created a patch that sets correct address on ctrl_mem subregion and guest 
OS boots up correctly.

When I looked into pci and memory region code I noticed that indeed address
is only assigned to top level memory region but not to contained subregions.
I think that because in your approach cmb grabs whole bar exclusively it works
fine.

Here is my question (perhaps pci folks can help answer :)): if we consider 
memory region overlapping for pci devices as valid use case should pci 
code on configuration cycles walk through all contained subregion and
update addr field accordingly?

Thx!



Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device

2020-07-15 Thread Klaus Jensen
On Jul 15 10:06, Klaus Jensen wrote:
> 
> Can you reproduce the issues with that patch? I can't on a stock Arch
> Linux 5.7.5-arch1-1 kernel.
> 

Was afraid I said this a bit too early since the stock kernel does not
enable the iommu by default. Tested with another kernel and yes, I still
cannot reproduce the vtd errors or any complaints from the kernel (w/
or w/o iommu).



Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device

2020-07-15 Thread Klaus Jensen
On Jul  7 21:44, Andrzej Jakowski wrote:
> On 7/6/20 12:15 AM, Klaus Jensen wrote:
> > On Jul  2 16:33, Andrzej Jakowski wrote:
> >> On 7/2/20 10:51 AM, Klaus Jensen wrote:
> >>> On Jul  2 08:07, Andrzej Jakowski wrote:
>  On 7/2/20 3:31 AM, Klaus Jensen wrote:
> > Aight, an update here. This only happens when QEMU is run with a virtual
> > IOMMU. Otherwise, the kernel is happy.
> >
> > With the vIOMMU, qemu also craps out a bit:
> >
> > qemu-system-x86_64: vtd_iova_to_slpte: detected slpte permission error 
> > (iova=0xfd20, level=0x2, slpte=0x0, write=0)
> > qemu-system-x86_64: vtd_iommu_translate: detected translation failure 
> > (dev=03:00:00, iova=0xfd20)
> >
> > So I think we are back in QEMU land for the bug.
> 
>  Can you share command line for that?
> 
> 
> >>>
> >>> qemu-system-x86_64 \
> >>>   -nodefaults \
> >>>   -display none \
> >>>   -device intel-iommu,pt,intremap=on,device-iotlb=on \
> >>>   -machine type=q35,accel=kvm,kernel_irqchip=split \
> >>>   -cpu host \
> >>>   -smp 4 \
> >>>   -m 8G \
> >>>   -nic user,model=virtio-net-pci,hostfwd=tcp::-:22 \
> >>>   -device virtio-rng-pci \
> >>>   -drive 
> >>> id=boot,file=/home/kbj/work/src/vmctl/state/pmr/boot.qcow2,format=qcow2,if=virtio,discard=on,detect-zeroes=unmap
> >>>  \
> >>>   -device pcie-root-port,id=pcie_root_port1,chassis=1,slot=0 \
> >>>   -device x3130-upstream,id=pcie_upstream1,bus=pcie_root_port1 \
> >>>   -device 
> >>> xio3130-downstream,id=pcie_downstream1,bus=pcie_upstream1,chassis=1,slot=1
> >>>  \
> >>>   -drive 
> >>> id=nvme0n1,file=/home/kbj/work/src/vmctl/state/pmr/nvme0n1.img,format=raw,if=none,discard=on,detect-zeroes=unmap
> >>>  \
> >>>   -object memory-backend-file,id=pmr,share=on,mem-path=pmr.bin,size=1M \
> >>>   -device 
> >>> nvme,id=nvme0,serial=deadbeef,bus=pcie_downstream1,drive=nvme0n1,msix_qsize=1,pmrdev=pmr,cmb_size_mb=2
> >>>  \
> >>>   -pidfile /home/kbj/work/src/vmctl/run/pmr/pidfile \
> >>>   -kernel /home/kbj/work/src/kernel/linux/arch/x86_64/boot/bzImage \
> >>>   -append root=/dev/vda1 console=ttyS0,115200 audit=0 nokaslr \
> >>>   -virtfs 
> >>> local,path=/home/kbj/work/src/kernel/linux,security_model=none,readonly,mount_tag=modules
> >>>  \
> >>>   -serial mon:stdio \
> >>>   -trace pci_nvme*
> >>>
> >>>
> >>
> >> I focused on reproduction and it looks to me that my patch doesn't 
> >> necessarily introduce regression. I run it w/ and w/o patch in both cases
> >> getting error while registering. Here is kernel guest log:
> >>
> >> [   87.606482] nvme nvme0: pci function :00:04.0
> >> [   87.635577] dev=95b0a83b bar=2 size=134217728 offset=0
> >> [   87.636593] nvme nvme0: failed to register the CMB ret=-95
> >> [   87.643262] nvme nvme0: 12/0/0 default/read/poll queues
> >>
> >> Any thoughts?
> >>
> > 
> > Hmm, that's not what I am seeing.
> > 
> > With kernel v5.8-rc4, I'm not seeing any issues with CMB with and
> > without IOMMU on QEMU master. With your patches, my kernel (v5.8-rc4)
> > pukes both with and without iommu.
> > 
> > BUT! This doesn't mean that your patch is bad, it looks more like an
> > issue in the kernel. I still think the BAR configuration looks sane, but
> > I am not expert on this.
> > 
> > To satisify my curiosity I tried mending your patch to put the CMB on
> > offset 0 and move the MSI-X vector table and PBA to BAR 0 (like I
> > suggested back in the day). That works. With and without IOMMU. So, I
> > think it is an issue with the Linux kernel not being too happy about the
> > CMB being at an offset. It doesn't directly look like an issue in the
> > nvme driver since the issue shows up far lower in the memory subsystem,
> > but it would be nice to have the linux nvme gang at least acknowledge
> > the issue.
> > 
> 
> I have managed to reproduce that problem and played with patch to see
> when the problem occurs vs not. 
> When I put MSIX back to BAR2 (no PMR at all) and CMB left at BAR4 but 
> starting at offset 0 I was still able to reproduce issue.
> So then I've played with memory region API and interesting observed that
> problem occurs when region overlaying is used via:
> 
> memory_region_init(>bar4, OBJECT(n), "nvme-bar4",  bar_size);$
> $  
> if (n->params.cmb_size_mb) {$
> memory_region_init_io(>ctrl_mem, OBJECT(n), _cmb_ops, n,$
>   "nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));$
> $  
> memory_region_add_subregion_overlap(>bar4, cmb_offset, >ctrl_mem, 
> 1);$
> }$
> 
> on the other hand when cmb memory region is initialized w/o region
> overlaying that is:
> 
> memory_region_init_io(>ctrl_mem, OBJECT(n), _cmb_ops, n,$
>   "nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
> 
> I get no reproduction.
> 
> Also observed qemu complaing about failed translation:
> qemu-system-x86_64: vtd_iova_to_slpte: detected slpte permission error 
> (iova=0xfe40, level=0x2, slpte=0x0, write=0)
> qemu-system-x86_64: 

Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device

2020-07-07 Thread Andrzej Jakowski
On 7/6/20 12:15 AM, Klaus Jensen wrote:
> On Jul  2 16:33, Andrzej Jakowski wrote:
>> On 7/2/20 10:51 AM, Klaus Jensen wrote:
>>> On Jul  2 08:07, Andrzej Jakowski wrote:
 On 7/2/20 3:31 AM, Klaus Jensen wrote:
> Aight, an update here. This only happens when QEMU is run with a virtual
> IOMMU. Otherwise, the kernel is happy.
>
> With the vIOMMU, qemu also craps out a bit:
>
> qemu-system-x86_64: vtd_iova_to_slpte: detected slpte permission error 
> (iova=0xfd20, level=0x2, slpte=0x0, write=0)
> qemu-system-x86_64: vtd_iommu_translate: detected translation failure 
> (dev=03:00:00, iova=0xfd20)
>
> So I think we are back in QEMU land for the bug.

 Can you share command line for that?


>>>
>>> qemu-system-x86_64 \
>>>   -nodefaults \
>>>   -display none \
>>>   -device intel-iommu,pt,intremap=on,device-iotlb=on \
>>>   -machine type=q35,accel=kvm,kernel_irqchip=split \
>>>   -cpu host \
>>>   -smp 4 \
>>>   -m 8G \
>>>   -nic user,model=virtio-net-pci,hostfwd=tcp::-:22 \
>>>   -device virtio-rng-pci \
>>>   -drive 
>>> id=boot,file=/home/kbj/work/src/vmctl/state/pmr/boot.qcow2,format=qcow2,if=virtio,discard=on,detect-zeroes=unmap
>>>  \
>>>   -device pcie-root-port,id=pcie_root_port1,chassis=1,slot=0 \
>>>   -device x3130-upstream,id=pcie_upstream1,bus=pcie_root_port1 \
>>>   -device 
>>> xio3130-downstream,id=pcie_downstream1,bus=pcie_upstream1,chassis=1,slot=1 \
>>>   -drive 
>>> id=nvme0n1,file=/home/kbj/work/src/vmctl/state/pmr/nvme0n1.img,format=raw,if=none,discard=on,detect-zeroes=unmap
>>>  \
>>>   -object memory-backend-file,id=pmr,share=on,mem-path=pmr.bin,size=1M \
>>>   -device 
>>> nvme,id=nvme0,serial=deadbeef,bus=pcie_downstream1,drive=nvme0n1,msix_qsize=1,pmrdev=pmr,cmb_size_mb=2
>>>  \
>>>   -pidfile /home/kbj/work/src/vmctl/run/pmr/pidfile \
>>>   -kernel /home/kbj/work/src/kernel/linux/arch/x86_64/boot/bzImage \
>>>   -append root=/dev/vda1 console=ttyS0,115200 audit=0 nokaslr \
>>>   -virtfs 
>>> local,path=/home/kbj/work/src/kernel/linux,security_model=none,readonly,mount_tag=modules
>>>  \
>>>   -serial mon:stdio \
>>>   -trace pci_nvme*
>>>
>>>
>>
>> I focused on reproduction and it looks to me that my patch doesn't 
>> necessarily introduce regression. I run it w/ and w/o patch in both cases
>> getting error while registering. Here is kernel guest log:
>>
>> [   87.606482] nvme nvme0: pci function :00:04.0
>> [   87.635577] dev=95b0a83b bar=2 size=134217728 offset=0
>> [   87.636593] nvme nvme0: failed to register the CMB ret=-95
>> [   87.643262] nvme nvme0: 12/0/0 default/read/poll queues
>>
>> Any thoughts?
>>
> 
> Hmm, that's not what I am seeing.
> 
> With kernel v5.8-rc4, I'm not seeing any issues with CMB with and
> without IOMMU on QEMU master. With your patches, my kernel (v5.8-rc4)
> pukes both with and without iommu.
> 
> BUT! This doesn't mean that your patch is bad, it looks more like an
> issue in the kernel. I still think the BAR configuration looks sane, but
> I am not expert on this.
> 
> To satisify my curiosity I tried mending your patch to put the CMB on
> offset 0 and move the MSI-X vector table and PBA to BAR 0 (like I
> suggested back in the day). That works. With and without IOMMU. So, I
> think it is an issue with the Linux kernel not being too happy about the
> CMB being at an offset. It doesn't directly look like an issue in the
> nvme driver since the issue shows up far lower in the memory subsystem,
> but it would be nice to have the linux nvme gang at least acknowledge
> the issue.
> 

I have managed to reproduce that problem and played with patch to see
when the problem occurs vs not. 
When I put MSIX back to BAR2 (no PMR at all) and CMB left at BAR4 but 
starting at offset 0 I was still able to reproduce issue.
So then I've played with memory region API and interesting observed that
problem occurs when region overlaying is used via:

memory_region_init(>bar4, OBJECT(n), "nvme-bar4",  bar_size);$
$  
if (n->params.cmb_size_mb) {$
memory_region_init_io(>ctrl_mem, OBJECT(n), _cmb_ops, n,$
  "nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));$
$  
memory_region_add_subregion_overlap(>bar4, cmb_offset, >ctrl_mem, 1);$
}$

on the other hand when cmb memory region is initialized w/o region
overlaying that is:

memory_region_init_io(>ctrl_mem, OBJECT(n), _cmb_ops, n,$
  "nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));

I get no reproduction.

Also observed qemu complaing about failed translation:
qemu-system-x86_64: vtd_iova_to_slpte: detected slpte permission error 
(iova=0xfe40, level=0x2, slpte=0x0, write=0)
qemu-system-x86_64: vtd_iommu_translate: detected translation failure 
(dev=03:00:00, iova=0xfe40)

Not sure how we want to proceed. Any suggestions?



Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device

2020-07-06 Thread Klaus Jensen
On Jul  2 16:33, Andrzej Jakowski wrote:
> On 7/2/20 10:51 AM, Klaus Jensen wrote:
> > On Jul  2 08:07, Andrzej Jakowski wrote:
> >> On 7/2/20 3:31 AM, Klaus Jensen wrote:
> >>> Aight, an update here. This only happens when QEMU is run with a virtual
> >>> IOMMU. Otherwise, the kernel is happy.
> >>>
> >>> With the vIOMMU, qemu also craps out a bit:
> >>>
> >>> qemu-system-x86_64: vtd_iova_to_slpte: detected slpte permission error 
> >>> (iova=0xfd20, level=0x2, slpte=0x0, write=0)
> >>> qemu-system-x86_64: vtd_iommu_translate: detected translation failure 
> >>> (dev=03:00:00, iova=0xfd20)
> >>>
> >>> So I think we are back in QEMU land for the bug.
> >>
> >> Can you share command line for that?
> >>
> >>
> > 
> > qemu-system-x86_64 \
> >   -nodefaults \
> >   -display none \
> >   -device intel-iommu,pt,intremap=on,device-iotlb=on \
> >   -machine type=q35,accel=kvm,kernel_irqchip=split \
> >   -cpu host \
> >   -smp 4 \
> >   -m 8G \
> >   -nic user,model=virtio-net-pci,hostfwd=tcp::-:22 \
> >   -device virtio-rng-pci \
> >   -drive 
> > id=boot,file=/home/kbj/work/src/vmctl/state/pmr/boot.qcow2,format=qcow2,if=virtio,discard=on,detect-zeroes=unmap
> >  \
> >   -device pcie-root-port,id=pcie_root_port1,chassis=1,slot=0 \
> >   -device x3130-upstream,id=pcie_upstream1,bus=pcie_root_port1 \
> >   -device 
> > xio3130-downstream,id=pcie_downstream1,bus=pcie_upstream1,chassis=1,slot=1 \
> >   -drive 
> > id=nvme0n1,file=/home/kbj/work/src/vmctl/state/pmr/nvme0n1.img,format=raw,if=none,discard=on,detect-zeroes=unmap
> >  \
> >   -object memory-backend-file,id=pmr,share=on,mem-path=pmr.bin,size=1M \
> >   -device 
> > nvme,id=nvme0,serial=deadbeef,bus=pcie_downstream1,drive=nvme0n1,msix_qsize=1,pmrdev=pmr,cmb_size_mb=2
> >  \
> >   -pidfile /home/kbj/work/src/vmctl/run/pmr/pidfile \
> >   -kernel /home/kbj/work/src/kernel/linux/arch/x86_64/boot/bzImage \
> >   -append root=/dev/vda1 console=ttyS0,115200 audit=0 nokaslr \
> >   -virtfs 
> > local,path=/home/kbj/work/src/kernel/linux,security_model=none,readonly,mount_tag=modules
> >  \
> >   -serial mon:stdio \
> >   -trace pci_nvme*
> > 
> > 
> 
> I focused on reproduction and it looks to me that my patch doesn't 
> necessarily introduce regression. I run it w/ and w/o patch in both cases
> getting error while registering. Here is kernel guest log:
> 
> [   87.606482] nvme nvme0: pci function :00:04.0
> [   87.635577] dev=95b0a83b bar=2 size=134217728 offset=0
> [   87.636593] nvme nvme0: failed to register the CMB ret=-95
> [   87.643262] nvme nvme0: 12/0/0 default/read/poll queues
> 
> Any thoughts?
> 

Hmm, that's not what I am seeing.

With kernel v5.8-rc4, I'm not seeing any issues with CMB with and
without IOMMU on QEMU master. With your patches, my kernel (v5.8-rc4)
pukes both with and without iommu.

BUT! This doesn't mean that your patch is bad, it looks more like an
issue in the kernel. I still think the BAR configuration looks sane, but
I am not expert on this.

To satisify my curiosity I tried mending your patch to put the CMB on
offset 0 and move the MSI-X vector table and PBA to BAR 0 (like I
suggested back in the day). That works. With and without IOMMU. So, I
think it is an issue with the Linux kernel not being too happy about the
CMB being at an offset. It doesn't directly look like an issue in the
nvme driver since the issue shows up far lower in the memory subsystem,
but it would be nice to have the linux nvme gang at least acknowledge
the issue.



Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device

2020-07-02 Thread Andrzej Jakowski
On 7/2/20 10:51 AM, Klaus Jensen wrote:
> On Jul  2 08:07, Andrzej Jakowski wrote:
>> On 7/2/20 3:31 AM, Klaus Jensen wrote:
>>> Aight, an update here. This only happens when QEMU is run with a virtual
>>> IOMMU. Otherwise, the kernel is happy.
>>>
>>> With the vIOMMU, qemu also craps out a bit:
>>>
>>> qemu-system-x86_64: vtd_iova_to_slpte: detected slpte permission error 
>>> (iova=0xfd20, level=0x2, slpte=0x0, write=0)
>>> qemu-system-x86_64: vtd_iommu_translate: detected translation failure 
>>> (dev=03:00:00, iova=0xfd20)
>>>
>>> So I think we are back in QEMU land for the bug.
>>
>> Can you share command line for that?
>>
>>
> 
> qemu-system-x86_64 \
>   -nodefaults \
>   -display none \
>   -device intel-iommu,pt,intremap=on,device-iotlb=on \
>   -machine type=q35,accel=kvm,kernel_irqchip=split \
>   -cpu host \
>   -smp 4 \
>   -m 8G \
>   -nic user,model=virtio-net-pci,hostfwd=tcp::-:22 \
>   -device virtio-rng-pci \
>   -drive 
> id=boot,file=/home/kbj/work/src/vmctl/state/pmr/boot.qcow2,format=qcow2,if=virtio,discard=on,detect-zeroes=unmap
>  \
>   -device pcie-root-port,id=pcie_root_port1,chassis=1,slot=0 \
>   -device x3130-upstream,id=pcie_upstream1,bus=pcie_root_port1 \
>   -device 
> xio3130-downstream,id=pcie_downstream1,bus=pcie_upstream1,chassis=1,slot=1 \
>   -drive 
> id=nvme0n1,file=/home/kbj/work/src/vmctl/state/pmr/nvme0n1.img,format=raw,if=none,discard=on,detect-zeroes=unmap
>  \
>   -object memory-backend-file,id=pmr,share=on,mem-path=pmr.bin,size=1M \
>   -device 
> nvme,id=nvme0,serial=deadbeef,bus=pcie_downstream1,drive=nvme0n1,msix_qsize=1,pmrdev=pmr,cmb_size_mb=2
>  \
>   -pidfile /home/kbj/work/src/vmctl/run/pmr/pidfile \
>   -kernel /home/kbj/work/src/kernel/linux/arch/x86_64/boot/bzImage \
>   -append root=/dev/vda1 console=ttyS0,115200 audit=0 nokaslr \
>   -virtfs 
> local,path=/home/kbj/work/src/kernel/linux,security_model=none,readonly,mount_tag=modules
>  \
>   -serial mon:stdio \
>   -trace pci_nvme*
> 
> 

I focused on reproduction and it looks to me that my patch doesn't 
necessarily introduce regression. I run it w/ and w/o patch in both cases
getting error while registering. Here is kernel guest log:

[   87.606482] nvme nvme0: pci function :00:04.0
[   87.635577] dev=95b0a83b bar=2 size=134217728 offset=0
[   87.636593] nvme nvme0: failed to register the CMB ret=-95
[   87.643262] nvme nvme0: 12/0/0 default/read/poll queues

Any thoughts?



Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device

2020-07-02 Thread Klaus Jensen
On Jul  2 08:07, Andrzej Jakowski wrote:
> On 7/2/20 3:31 AM, Klaus Jensen wrote:
> > Aight, an update here. This only happens when QEMU is run with a virtual
> > IOMMU. Otherwise, the kernel is happy.
> > 
> > With the vIOMMU, qemu also craps out a bit:
> > 
> > qemu-system-x86_64: vtd_iova_to_slpte: detected slpte permission error 
> > (iova=0xfd20, level=0x2, slpte=0x0, write=0)
> > qemu-system-x86_64: vtd_iommu_translate: detected translation failure 
> > (dev=03:00:00, iova=0xfd20)
> > 
> > So I think we are back in QEMU land for the bug.
> 
> Can you share command line for that?
> 
> 

qemu-system-x86_64 \
  -nodefaults \
  -display none \
  -device intel-iommu,pt,intremap=on,device-iotlb=on \
  -machine type=q35,accel=kvm,kernel_irqchip=split \
  -cpu host \
  -smp 4 \
  -m 8G \
  -nic user,model=virtio-net-pci,hostfwd=tcp::-:22 \
  -device virtio-rng-pci \
  -drive 
id=boot,file=/home/kbj/work/src/vmctl/state/pmr/boot.qcow2,format=qcow2,if=virtio,discard=on,detect-zeroes=unmap
 \
  -device pcie-root-port,id=pcie_root_port1,chassis=1,slot=0 \
  -device x3130-upstream,id=pcie_upstream1,bus=pcie_root_port1 \
  -device 
xio3130-downstream,id=pcie_downstream1,bus=pcie_upstream1,chassis=1,slot=1 \
  -drive 
id=nvme0n1,file=/home/kbj/work/src/vmctl/state/pmr/nvme0n1.img,format=raw,if=none,discard=on,detect-zeroes=unmap
 \
  -object memory-backend-file,id=pmr,share=on,mem-path=pmr.bin,size=1M \
  -device 
nvme,id=nvme0,serial=deadbeef,bus=pcie_downstream1,drive=nvme0n1,msix_qsize=1,pmrdev=pmr,cmb_size_mb=2
 \
  -pidfile /home/kbj/work/src/vmctl/run/pmr/pidfile \
  -kernel /home/kbj/work/src/kernel/linux/arch/x86_64/boot/bzImage \
  -append root=/dev/vda1 console=ttyS0,115200 audit=0 nokaslr \
  -virtfs 
local,path=/home/kbj/work/src/kernel/linux,security_model=none,readonly,mount_tag=modules
 \
  -serial mon:stdio \
  -trace pci_nvme*





Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device

2020-07-02 Thread Andrzej Jakowski
On 7/2/20 3:31 AM, Klaus Jensen wrote:
> Aight, an update here. This only happens when QEMU is run with a virtual
> IOMMU. Otherwise, the kernel is happy.
> 
> With the vIOMMU, qemu also craps out a bit:
> 
> qemu-system-x86_64: vtd_iova_to_slpte: detected slpte permission error 
> (iova=0xfd20, level=0x2, slpte=0x0, write=0)
> qemu-system-x86_64: vtd_iommu_translate: detected translation failure 
> (dev=03:00:00, iova=0xfd20)
> 
> So I think we are back in QEMU land for the bug.

Can you share command line for that?




Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device

2020-07-02 Thread Klaus Jensen
On Jul  2 12:13, Klaus Jensen wrote:
> On Jul  1 14:48, Andrzej Jakowski wrote:
> > So far it was not possible to have CMB and PMR emulated on the same
> > device, because BAR2 was used exclusively either of PMR or CMB. This
> > patch places CMB at BAR4 offset so it not conflicts with MSI-X vectors.
> > 
> 
> Linux craps out when I test this (1MB CMB):
> 
> Misaligned __add_pages start: 0xfdd00 end: #fdeff
> 
> I tracked it down to check_pfn_span (mm/memory_hotplug.c) failing
> because it's not on a 2MB boundary. I then tried to monkey patch the
> cmb_offset to be 2MB instead and it succeeds in registering the cmb:
> 
> [8.384565] memmap_init_zone_device initialised 512 pages in 0ms
> [8.385715] nvme :03:00.0: added peer-to-peer DMA memory [mem 
> 0xfd20-0xfd3f 64bit
> pref]
> [8.419372] nvme nvme0: 1/0/0 default/read/poll queues
> 
> But the kernel then continues to really crap out with a kernel panic:
> 
> [8.440891] DMAR: DRHD: handling fault status reg 2
> [8.440934] BUG: kernel NULL pointer dereference, address: 0120
> [8.441713] DMAR: [DMA Read] Request device [03:00.0] PASID  fault 
> addr fd20 [faul
> t reason 06] PTE Read access is not set
> [8.442630] #PF: supervisor write access in kernel mode
> [8.444972] #PF: error_code(0x0002) - not-present page
> [8.445640] PGD 0 P4D 0
> [8.445965] Oops: 0002 [#1] PREEMPT SMP PTI
> [8.446499] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 
> 5.8.0-rc1-00034-gb6cf9836d07f-dirty #19
> [8.447547] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> rel-1.13.0-0-gf21b5a4aeb0
> 2-prebuilt.qemu.org 04/01/2014
> [8.448898] RIP: 0010:nvme_process_cq+0xc4/0x200 [nvme]
> [8.449525] Code: cf 00 00 00 48 8b 57 70 48 8b 7c c2 f8 e8 14 e9 32 c1 49 
> 89 c7 0f 1f 44 00 0
> 0 41 0f b7 44 24 0e 49 8b 14 24 4c 89 ff 66 d1 e8 <49> 89 97 20 01 00 00 66 
> 41 89 87 2a 01 00 00
> e8 28 04 33 c1 0f b7
> [8.451662] RSP: 0018:c915cf20 EFLAGS: 00010803
> [8.452321] RAX: 400b RBX: 88826faad0c0 RCX: 
> 
> [8.453293] RDX:  RSI:  RDI: 
> 
> [8.454312] RBP: 8882725d38e4 R08: 0001f71e225c R09: 
> 
> [8.455319] R10:  R11:  R12: 
> 888270bb
> [8.456334] R13:  R14: c915cfac R15: 
> 
> [8.457311] FS:  () GS:888277d8() 
> knlGS:
> [8.458441] CS:  0010 DS:  ES:  CR0: 80050033
> [8.459380] CR2: 0120 CR3: 000271c8c006 CR4: 
> 00360ee0
> [8.460507] Call Trace:
> [8.460906]  
> [8.461272]  nvme_irq+0x10/0x20 [nvme]
> [8.461951]  __handle_irq_event_percpu+0x45/0x1b0
> [8.462803]  ? handle_fasteoi_irq+0x210/0x210
> [8.463585]  handle_irq_event+0x58/0xb0
> [8.464312]  handle_edge_irq+0xae/0x270
> [8.465027]  asm_call_on_stack+0x12/0x20
> [8.465686]  
> [8.466053]  common_interrupt+0x120/0x1f0
> [8.466717]  asm_common_interrupt+0x1e/0x40
> [8.467429] RIP: 0010:default_idle+0x21/0x170
> [8.468140] Code: eb a6 e8 82 27 ff ff cc cc 0f 1f 44 00 00 41 54 55 53 e8 
> e2 2e ff ff 0f 1f 4
> 4 00 00 e9 07 00 00 00 0f 00 2d f3 d4 44 00 fb f4  ca 2e ff ff 89 c5 0f 
> 1f 44 00 00 5b 5d 41
> 5c c3 89 c5 65 8b 05
> [8.471286] RSP: 0018:c909fec8 EFLAGS: 0282
> [8.472202] RAX: 0003 RBX: 888276ff RCX: 
> 0001
> [8.473405] RDX: 0001 RSI: 8212355f RDI: 
> 8212d699
> [8.474571] RBP: 0003 R08: 888277d9e4a0 R09: 
> 0020
> [8.475717] R10:  R11:  R12: 
> 
> [8.476921] R13:  R14:  R15: 
> 
> [8.478110]  ? default_idle+0xe/0x170
> [8.478728]  do_idle+0x1e1/0x240
> [8.479283]  ? _raw_spin_lock_irqsave+0x19/0x40
> [8.480040]  cpu_startup_entry+0x19/0x20
> [8.480705]  start_secondary+0x153/0x190
> [8.481400]  secondary_startup_64+0xb6/0xc0
> [8.482114] Modules linked in: libata nvme nvme_core scsi_mod t10_pi 
> crc_t10dif crct10dif_gene
> ric crct10dif_common virtio_net net_failover failover virtio_rng rng_core
> [8.484675] CR2: 0120
> [8.485264] ---[ end trace ff1849437c76af12 ]---
> [8.486065] RIP: 0010:nvme_process_cq+0xc4/0x200 [nvme]
> [8.486953] Code: cf 00 00 00 48 8b 57 70 48 8b 7c c2 f8 e8 14 e9 32 c1 49 
> 89 c7 0f 1f 44 00 0
> 0 41 0f b7 44 24 0e 49 8b 14 24 4c 89 ff 66 d1 e8 <49> 89 97 20 01 00 00 66 
> 41 89 87 2a 01 00 00
> e8 28 04 33 c1 0f b7
> [8.490234] RSP: 0018:c915cf20 EFLAGS: 00010803
> [8.491144] RAX: 400b RBX: 88826faad0c0 RCX: 
> 
> [8.492445] RDX:  RSI:  RDI: 
> 
> [8.493681] 

Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device

2020-07-02 Thread Klaus Jensen
On Jul  1 14:48, Andrzej Jakowski wrote:
> So far it was not possible to have CMB and PMR emulated on the same
> device, because BAR2 was used exclusively either of PMR or CMB. This
> patch places CMB at BAR4 offset so it not conflicts with MSI-X vectors.
> 

Linux craps out when I test this (1MB CMB):

Misaligned __add_pages start: 0xfdd00 end: #fdeff

I tracked it down to check_pfn_span (mm/memory_hotplug.c) failing
because it's not on a 2MB boundary. I then tried to monkey patch the
cmb_offset to be 2MB instead and it succeeds in registering the cmb:

[8.384565] memmap_init_zone_device initialised 512 pages in 0ms
[8.385715] nvme :03:00.0: added peer-to-peer DMA memory [mem 
0xfd20-0xfd3f 64bit
pref]
[8.419372] nvme nvme0: 1/0/0 default/read/poll queues

But the kernel then continues to really crap out with a kernel panic:

[8.440891] DMAR: DRHD: handling fault status reg 2
[8.440934] BUG: kernel NULL pointer dereference, address: 0120
[8.441713] DMAR: [DMA Read] Request device [03:00.0] PASID  fault 
addr fd20 [faul
t reason 06] PTE Read access is not set
[8.442630] #PF: supervisor write access in kernel mode
[8.444972] #PF: error_code(0x0002) - not-present page
[8.445640] PGD 0 P4D 0
[8.445965] Oops: 0002 [#1] PREEMPT SMP PTI
[8.446499] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 
5.8.0-rc1-00034-gb6cf9836d07f-dirty #19
[8.447547] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
rel-1.13.0-0-gf21b5a4aeb0
2-prebuilt.qemu.org 04/01/2014
[8.448898] RIP: 0010:nvme_process_cq+0xc4/0x200 [nvme]
[8.449525] Code: cf 00 00 00 48 8b 57 70 48 8b 7c c2 f8 e8 14 e9 32 c1 49 
89 c7 0f 1f 44 00 0
0 41 0f b7 44 24 0e 49 8b 14 24 4c 89 ff 66 d1 e8 <49> 89 97 20 01 00 00 66 41 
89 87 2a 01 00 00
e8 28 04 33 c1 0f b7
[8.451662] RSP: 0018:c915cf20 EFLAGS: 00010803
[8.452321] RAX: 400b RBX: 88826faad0c0 RCX: 
[8.453293] RDX:  RSI:  RDI: 
[8.454312] RBP: 8882725d38e4 R08: 0001f71e225c R09: 
[8.455319] R10:  R11:  R12: 888270bb
[8.456334] R13:  R14: c915cfac R15: 
[8.457311] FS:  () GS:888277d8() 
knlGS:
[8.458441] CS:  0010 DS:  ES:  CR0: 80050033
[8.459380] CR2: 0120 CR3: 000271c8c006 CR4: 00360ee0
[8.460507] Call Trace:
[8.460906]  
[8.461272]  nvme_irq+0x10/0x20 [nvme]
[8.461951]  __handle_irq_event_percpu+0x45/0x1b0
[8.462803]  ? handle_fasteoi_irq+0x210/0x210
[8.463585]  handle_irq_event+0x58/0xb0
[8.464312]  handle_edge_irq+0xae/0x270
[8.465027]  asm_call_on_stack+0x12/0x20
[8.465686]  
[8.466053]  common_interrupt+0x120/0x1f0
[8.466717]  asm_common_interrupt+0x1e/0x40
[8.467429] RIP: 0010:default_idle+0x21/0x170
[8.468140] Code: eb a6 e8 82 27 ff ff cc cc 0f 1f 44 00 00 41 54 55 53 e8 
e2 2e ff ff 0f 1f 4
4 00 00 e9 07 00 00 00 0f 00 2d f3 d4 44 00 fb f4  ca 2e ff ff 89 c5 0f 1f 
44 00 00 5b 5d 41
5c c3 89 c5 65 8b 05
[8.471286] RSP: 0018:c909fec8 EFLAGS: 0282
[8.472202] RAX: 0003 RBX: 888276ff RCX: 0001
[8.473405] RDX: 0001 RSI: 8212355f RDI: 8212d699
[8.474571] RBP: 0003 R08: 888277d9e4a0 R09: 0020
[8.475717] R10:  R11:  R12: 
[8.476921] R13:  R14:  R15: 
[8.478110]  ? default_idle+0xe/0x170
[8.478728]  do_idle+0x1e1/0x240
[8.479283]  ? _raw_spin_lock_irqsave+0x19/0x40
[8.480040]  cpu_startup_entry+0x19/0x20
[8.480705]  start_secondary+0x153/0x190
[8.481400]  secondary_startup_64+0xb6/0xc0
[8.482114] Modules linked in: libata nvme nvme_core scsi_mod t10_pi 
crc_t10dif crct10dif_gene
ric crct10dif_common virtio_net net_failover failover virtio_rng rng_core
[8.484675] CR2: 0120
[8.485264] ---[ end trace ff1849437c76af12 ]---
[8.486065] RIP: 0010:nvme_process_cq+0xc4/0x200 [nvme]
[8.486953] Code: cf 00 00 00 48 8b 57 70 48 8b 7c c2 f8 e8 14 e9 32 c1 49 
89 c7 0f 1f 44 00 0
0 41 0f b7 44 24 0e 49 8b 14 24 4c 89 ff 66 d1 e8 <49> 89 97 20 01 00 00 66 41 
89 87 2a 01 00 00
e8 28 04 33 c1 0f b7
[8.490234] RSP: 0018:c915cf20 EFLAGS: 00010803
[8.491144] RAX: 400b RBX: 88826faad0c0 RCX: 
[8.492445] RDX:  RSI:  RDI: 
[8.493681] RBP: 8882725d38e4 R08: 0001f71e225c R09: 
[8.494907] R10:  R11:  R12: 888270bb
[8.496130] R13:  R14: c915cfac R15: 
[8.497363] FS:  () 

[PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device

2020-07-01 Thread Andrzej Jakowski
So far it was not possible to have CMB and PMR emulated on the same
device, because BAR2 was used exclusively either of PMR or CMB. This
patch places CMB at BAR4 offset so it not conflicts with MSI-X vectors.

Signed-off-by: Andrzej Jakowski 
---
 hw/block/nvme.c  | 101 +--
 hw/block/nvme.h  |   1 +
 include/block/nvme.h |   4 +-
 3 files changed, 72 insertions(+), 34 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 9f11f3e9da..f5156bcfe5 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -22,12 +22,12 @@
  *  [pmrdev=,] \
  *  max_ioqpairs=
  *
- * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
- * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
+ * Note cmb_size_mb denotes size of CMB in MB. CMB when configured is assumed
+ * to be resident in BAR4 at certain offset - this is because BAR4 is also
+ * used for storing MSI-X table that is available at offset 0 in BAR4.
  *
- * cmb_size_mb= and pmrdev= options are mutually exclusive due to limitation
- * in available BAR's. cmb_size_mb= will take precedence over pmrdev= when
- * both provided.
+ * pmrdev is assumed to be resident in BAR2/BAR3. When configured it consumes
+ * whole BAR2/BAR3 exclusively.
  * Enabling pmr emulation can be achieved by pointing to memory-backend-file.
  * For example:
  * -object memory-backend-file,id=,share=on,mem-path=, \
@@ -57,8 +57,8 @@
 #define NVME_MAX_IOQPAIRS 0x
 #define NVME_REG_SIZE 0x1000
 #define NVME_DB_SIZE  4
-#define NVME_CMB_BIR 2
 #define NVME_PMR_BIR 2
+#define NVME_MSIX_BIR 4
 
 #define NVME_GUEST_ERR(trace, fmt, ...) \
 do { \
@@ -1395,7 +1395,7 @@ static void nvme_check_constraints(NvmeCtrl *n, Error 
**errp)
 return;
 }
 
-if (!n->params.cmb_size_mb && n->pmrdev) {
+if (n->pmrdev) {
 if (host_memory_backend_is_mapped(n->pmrdev)) {
 char *path = 
object_get_canonical_path_component(OBJECT(n->pmrdev));
 error_setg(errp, "can't use already busy memdev: %s", path);
@@ -1453,33 +1453,70 @@ static void nvme_init_namespace(NvmeCtrl *n, 
NvmeNamespace *ns, Error **errp)
 id_ns->nuse = id_ns->ncap;
 }
 
-static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
+static void nvme_bar4_init(PCIDevice *pci_dev, Error **errp)
 {
-NVME_CMBLOC_SET_BIR(n->bar.cmbloc, NVME_CMB_BIR);
-NVME_CMBLOC_SET_OFST(n->bar.cmbloc, 0);
-
-NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
-NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 0);
-NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
-NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
-NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
-NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2); /* MBs */
-NVME_CMBSZ_SET_SZ(n->bar.cmbsz, n->params.cmb_size_mb);
-
-n->cmbuf = g_malloc0(NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
-memory_region_init_io(>ctrl_mem, OBJECT(n), _cmb_ops, n,
-  "nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
-pci_register_bar(pci_dev, NVME_CMBLOC_BIR(n->bar.cmbloc),
+NvmeCtrl *n = NVME(pci_dev);
+int status;
+uint64_t bar_size, cmb_offset = 0;
+uint32_t msix_vectors;
+uint32_t nvme_pba_offset;
+uint32_t cmb_size_units;
+
+msix_vectors = n->params.max_ioqpairs + 1;
+nvme_pba_offset = PCI_MSIX_ENTRY_SIZE * msix_vectors;
+bar_size = nvme_pba_offset + QEMU_ALIGN_UP(msix_vectors, 64) / 8;
+
+if (n->params.cmb_size_mb) {
+NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
+NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 0);
+NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
+NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
+NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
+NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2); /* MBs */
+NVME_CMBSZ_SET_SZ(n->bar.cmbsz, n->params.cmb_size_mb);
+
+cmb_size_units = NVME_CMBSZ_GETSIZEUNITS(n->bar.cmbsz);
+cmb_offset = QEMU_ALIGN_UP(bar_size, cmb_size_units);
+
+NVME_CMBLOC_SET_BIR(n->bar.cmbloc, NVME_MSIX_BIR);
+NVME_CMBLOC_SET_OFST(n->bar.cmbloc, cmb_offset / cmb_size_units);
+
+n->cmbuf = g_malloc0(NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
+
+bar_size += cmb_offset;
+bar_size += NVME_CMBSZ_GETSIZE(n->bar.cmbsz);
+}
+
+bar_size = pow2ceil(bar_size);
+
+/* Create memory region for BAR4, then overlap cmb, msix and pba
+ * tables on top of it.*/
+memory_region_init(>bar4, OBJECT(n), "nvme-bar4", bar_size);
+
+if (n->params.cmb_size_mb) {
+memory_region_init_io(>ctrl_mem, OBJECT(n), _cmb_ops, n,
+  "nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
+
+memory_region_add_subregion(>bar4, cmb_offset, >ctrl_mem);
+}
+
+status = msix_init(pci_dev, n->params.msix_qsize,
+   >bar4, NVME_MSIX_BIR, 0,
+   >bar4, NVME_MSIX_BIR, nvme_pba_offset,
+   0, errp);
+
+if (status) {
+return;
+}
+
+pci_register_bar(pci_dev, NVME_MSIX_BIR,