Re: [PATCH] KVM: arm/arm64: Set ICH_HCR_EN in guest anyway when using gicv4

2019-03-13 Thread Marc Zyngier
On 13/03/2019 15:59, Shameerali Kolothum Thodi wrote:

Hi Shameer,

>> Can you please give the following patch a go? I can't test it, but hopefully
>> you can.
> 
> Thanks for your quick response. I just did a quick test on one of our 
> platforms
> with VHE+GICv4 and it seems to fix the performance issue we were seeing
> when GICv4 is enabled.
> 
> Test setup:
> 
> Host connected to a PC over a 10G port.
> Launch Guest with an assigned vf dev.
> Run iperf from Guest,
> 
> 5.0 kernel:
> [ ID]  Interval   Transfer Bandwidth
> [  3]  0.0-10.0 sec  1.30 GBytes  1.12 Gbits/sec
> 
> +Patch:
> 
> [ ID] Interval   Transfer Bandwidth
> [  3]  0.0-10.0 sec  10.9 GBytes  9.39 Gbits/sec

Ah, that looks much better! I'll try to write it properly, as I think
what we have today is slightly bizarre (we write ICH_HCR_EL2 from too
many paths, and I need to remember how the whole thing works).

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


RE: [PATCH] KVM: arm/arm64: Set ICH_HCR_EN in guest anyway when using gicv4

2019-03-13 Thread Shameerali Kolothum Thodi
Hi Marc,

> -Original Message-
> From: kvmarm-boun...@lists.cs.columbia.edu
> [mailto:kvmarm-boun...@lists.cs.columbia.edu] On Behalf Of Marc Zyngier
> Sent: 13 March 2019 11:59
> To: Tangnianyao (ICT) ; Christoffer Dall
> ; james.mo...@arm.com; julien.thie...@arm.com;
> suzuki.poul...@arm.com; catalin.mari...@arm.com; will.dea...@arm.com;
> alex.ben...@linaro.org; mark.rutl...@arm.com; andre.przyw...@arm.com;
> Zhangshaokun ; keesc...@chromium.org;
> linux-arm-ker...@lists.infradead.org; kvmarm@lists.cs.columbia.edu;
> linux-ker...@vger.kernel.org
> Subject: Re: [PATCH] KVM: arm/arm64: Set ICH_HCR_EN in guest anyway when
> using gicv4
> 
> On 12/03/2019 15:48, Marc Zyngier wrote:
> > Nianyao,
> >
> > Please do not send patches as HTML. Or any email as HTML. Please make
> > sure that you only send plain text emails on any mailing list (see point
> > #6 in Documentation/process/submitting-patches.rst).
> >
> > On 12/03/2019 12:22, Tangnianyao (ICT) wrote:
> >> In gicv4, direct vlpi may be forward to PE without using LR or ap_list. If
> >>
> >> ICH_HCR_EL2.En is 0 in guest, direct vlpi cannot be accepted by PE.
> >>
> >> It will take long time for direct vlpi to be forwarded in some cases.
> >>
> >> Direct vlpi has to wait for ICH_HCR_EL2.En=1 where some other interrupts
> >>
> >> using LR need to be forwarded, because in kvm_vgic_flush_hwstate,
> >>
> >> if ap_list is empty, it will return quickly not setting ICH_HCR_EL2.En.
> >>
> >> To fix this, set ICH_HCR_EL2.En to 1 before returning to guest when
> >>
> >> using GICv4.
> >>
> >>
> >>
> >> Signed-off-by: Nianyao Tang 
> >>
> >> ---
> >>
> >> arch/arm64/include/asm/kvm_asm.h |  1 +
> >>
> >> virt/kvm/arm/hyp/vgic-v3-sr.c    | 10 ++
> >>
> >> virt/kvm/arm/vgic/vgic-v4.c  |  8 
> >>
> >> 3 files changed, 19 insertions(+)
> >>
> >>
> >>
> >> diff --git a/arch/arm64/include/asm/kvm_asm.h
> >> b/arch/arm64/include/asm/kvm_asm.h
> >>
> >> index f5b79e9..0581c4d 100644
> >>
> >> --- a/arch/arm64/include/asm/kvm_asm.h
> >>
> >> +++ b/arch/arm64/include/asm/kvm_asm.h
> >>
> >> @@ -79,6 +79,7 @@
> >>
> >> extern void __vgic_v3_init_lrs(void);
> >>
> >>  extern u32 __kvm_get_mdcr_el2(void);
> >>
> >> +extern void __vgic_v3_write_hcr(u32 vmcr);
> >>
> >>  /* Home-grown __this_cpu_{ptr,read} variants that always work at HYP */
> >>
> >> #define
> >>
> __hyp_this_cpu_ptr(sym)
> 
> >> \
> >>
> >> diff --git a/virt/kvm/arm/hyp/vgic-v3-sr.c b/virt/kvm/arm/hyp/vgic-v3-sr.c
> >>
> >> index 264d92d..12027af 100644
> >>
> >> --- a/virt/kvm/arm/hyp/vgic-v3-sr.c
> >>
> >> +++ b/virt/kvm/arm/hyp/vgic-v3-sr.c
> >>
> >> @@ -434,6 +434,16 @@ void __hyp_text __vgic_v3_write_vmcr(u32 vmcr)
> >>
> >>    write_gicreg(vmcr, ICH_VMCR_EL2);
> >>
> >> }
> >>
> >> +u64 __hyp_text __vgic_v3_read_hcr(void)
> >>
> >> +{
> >>
> >> +   return read_gicreg(ICH_HCR_EL2);
> >>
> >> +}
> >>
> >> +
> >>
> >> +void __hyp_text __vgic_v3_write_hcr(u32 vmcr)
> >>
> >> +{
> >>
> >> +   write_gicreg(vmcr, ICH_HCR_EL2);
> >>
> >> +}
> >
> > This is HYP code...
> >
> >>
> >> +
> >>
> >> #ifdef CONFIG_ARM64
> >>
> >>  static int __hyp_text __vgic_v3_bpr_min(void)
> >>
> >> diff --git a/virt/kvm/arm/vgic/vgic-v4.c b/virt/kvm/arm/vgic/vgic-v4.c
> >>
> >> index 1ed5f22..515171a 100644
> >>
> >> --- a/virt/kvm/arm/vgic/vgic-v4.c
> >>
> >> +++ b/virt/kvm/arm/vgic/vgic-v4.c
> >>
> >> @@ -208,6 +208,8 @@ int vgic_v4_sync_hwstate(struct kvm_vcpu *vcpu)
> >>
> >>    if (!vgic_supports_direct_msis(vcpu->kvm))
> >>
> >>     return 0;
> >>
> >> +   __vgic_v3_write_hcr(vcpu->arch.vgic_cpu.vgic_v3.vgic_hcr &
> >> ~ICH_HCR_EN);
> >
> > And you've now crashed your non-VHE system by branching directly to code
> > that cannot be executed at EL1.
> >
> >>
> >> +
> >>
> >>    return its_schedule_vpe(>arch.vgic_cpu.vgic_v3.its_vpe,
> false);
> >>
> >> }
> >>
> >> @@ -220,6 +222,12 @@ int vgic_v4_flush_hwstate(struct kvm_vcpu *vcpu)
> >>
> >>     return 0;
> >>
> >>     /*
> >>
> >> +   * Enable ICH_HCR_EL.En so that guest can accpet and handle
> direct
> >>
> >> +   * vlpi.
> >>
> >> +   */
> >>
> >> +   __vgic_v3_write_hcr(vcpu->arch.vgic_cpu.vgic_v3.vgic_hcr);
> >>
> >> +
> >>
> >> +   /*
> >>
> >>     * Before making the VPE resident, make sure the redistributor
> >>
> >>     * corresponding to our current CPU expects us here. See the
> >>
> >>     * doc in drivers/irqchip/irq-gic-v4.c to understand how this
> >>
> >> --
> >>
> >> 1.9.1
> >>
> >>
> >>
> >>
> >>
> >
> > Overall, this looks like the wrong approach. It is not the GICv4 code's
> > job to do this, but the low-level code (either the load/put code for VHE
> > or the save/restore code for non-VHE).
> >
> > It would certainly help if you could describe which context you're in
> > (VHE, non-VHE). I suppose the former...
> 
> Can you please give the following patch a go? I can't test it, but hopefully
> you can.

Thanks for your quick response. I just did a 

Re: [RFC] arm/cpu: fix soft lockup panic after resuming from stop

2019-03-13 Thread Heyi Guo

Dear all,

Really appreciate your comments and information. For "Live Physical Time", is 
there any document to tell the whole story? And may I know the timeline to make it happen?

Thanks,

Heyi


On 2019/3/12 22:12, Marc Zyngier wrote:

Hi Peter,

On 12/03/2019 10:08, Peter Maydell wrote:

On Tue, 12 Mar 2019 at 06:10, Heyi Guo  wrote:

When we stop a VM for more than 30 seconds and then resume it, by qemu
monitor command "stop" and "cont", Linux on VM will complain of "soft
lockup - CPU#x stuck for xxs!" as below:

[ 2783.809517] watchdog: BUG: soft lockup - CPU#3 stuck for 2395s!
[ 2783.809559] watchdog: BUG: soft lockup - CPU#2 stuck for 2395s!
[ 2783.809561] watchdog: BUG: soft lockup - CPU#1 stuck for 2395s!
[ 2783.809563] Modules linked in...

This is because Guest Linux uses generic timer virtual counter as
a software watchdog, and CNTVCT_EL0 does not stop when VM is stopped
by qemu.

This patch is to fix this issue by saving the value of CNTVCT_EL0 when
stopping and restoring it when resuming.

Hi -- I know we have issues with the passage of time in Arm VMs
running under KVM when the VM is suspended, but the topic is
a tricky one, and it's not clear to me that this is the correct
way to fix it. I would prefer to see us start with a discussion
on the kvm-arm mailing list about the best approach to the problem.

I've cc'd that list and a couple of the Arm KVM maintainers
for their opinion.

QEMU patch left below for context -- the brief summary is that
it uses KVM_GET_ONE_REG/KVM_SET_ONE_REG on the timer CNT register
to save it on VM pause and write that value back on VM resume.

That's probably good enough for this particular use case, but I think
there is more. I can get into similar trouble if I suspend my laptop, or
suspend QEMU. It also has the slightly bizarre effect of skewing time,
and this will affect timekeeping in the guest in ways that are much more
subtle than just shouty CPUs.

Christoffer and Steve had some stuff regarding Live Physical Time, which
should cover that, and other cases such as host system suspend, and QEMU
being suspended.

Thanks,

M.



___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] KVM: arm/arm64: Set ICH_HCR_EN in guest anyway when using gicv4

2019-03-13 Thread Marc Zyngier
On 12/03/2019 15:48, Marc Zyngier wrote:
> Nianyao,
> 
> Please do not send patches as HTML. Or any email as HTML. Please make
> sure that you only send plain text emails on any mailing list (see point
> #6 in Documentation/process/submitting-patches.rst).
> 
> On 12/03/2019 12:22, Tangnianyao (ICT) wrote:
>> In gicv4, direct vlpi may be forward to PE without using LR or ap_list. If
>>
>> ICH_HCR_EL2.En is 0 in guest, direct vlpi cannot be accepted by PE.
>>
>> It will take long time for direct vlpi to be forwarded in some cases.
>>
>> Direct vlpi has to wait for ICH_HCR_EL2.En=1 where some other interrupts
>>
>> using LR need to be forwarded, because in kvm_vgic_flush_hwstate,
>>
>> if ap_list is empty, it will return quickly not setting ICH_HCR_EL2.En.
>>
>> To fix this, set ICH_HCR_EL2.En to 1 before returning to guest when
>>
>> using GICv4.
>>
>>  
>>
>> Signed-off-by: Nianyao Tang 
>>
>> ---
>>
>> arch/arm64/include/asm/kvm_asm.h |  1 +
>>
>> virt/kvm/arm/hyp/vgic-v3-sr.c    | 10 ++
>>
>> virt/kvm/arm/vgic/vgic-v4.c  |  8 
>>
>> 3 files changed, 19 insertions(+)
>>
>>  
>>
>> diff --git a/arch/arm64/include/asm/kvm_asm.h
>> b/arch/arm64/include/asm/kvm_asm.h
>>
>> index f5b79e9..0581c4d 100644
>>
>> --- a/arch/arm64/include/asm/kvm_asm.h
>>
>> +++ b/arch/arm64/include/asm/kvm_asm.h
>>
>> @@ -79,6 +79,7 @@
>>
>> extern void __vgic_v3_init_lrs(void);
>>
>>  extern u32 __kvm_get_mdcr_el2(void);
>>
>> +extern void __vgic_v3_write_hcr(u32 vmcr);
>>
>>  /* Home-grown __this_cpu_{ptr,read} variants that always work at HYP */
>>
>> #define
>> __hyp_this_cpu_ptr(sym)  
>>  
>> \
>>
>> diff --git a/virt/kvm/arm/hyp/vgic-v3-sr.c b/virt/kvm/arm/hyp/vgic-v3-sr.c
>>
>> index 264d92d..12027af 100644
>>
>> --- a/virt/kvm/arm/hyp/vgic-v3-sr.c
>>
>> +++ b/virt/kvm/arm/hyp/vgic-v3-sr.c
>>
>> @@ -434,6 +434,16 @@ void __hyp_text __vgic_v3_write_vmcr(u32 vmcr)
>>
>>    write_gicreg(vmcr, ICH_VMCR_EL2);
>>
>> }
>>
>> +u64 __hyp_text __vgic_v3_read_hcr(void)
>>
>> +{
>>
>> +   return read_gicreg(ICH_HCR_EL2);
>>
>> +}
>>
>> +
>>
>> +void __hyp_text __vgic_v3_write_hcr(u32 vmcr)
>>
>> +{
>>
>> +   write_gicreg(vmcr, ICH_HCR_EL2);
>>
>> +}
> 
> This is HYP code...
> 
>>
>> +
>>
>> #ifdef CONFIG_ARM64
>>
>>  static int __hyp_text __vgic_v3_bpr_min(void)
>>
>> diff --git a/virt/kvm/arm/vgic/vgic-v4.c b/virt/kvm/arm/vgic/vgic-v4.c
>>
>> index 1ed5f22..515171a 100644
>>
>> --- a/virt/kvm/arm/vgic/vgic-v4.c
>>
>> +++ b/virt/kvm/arm/vgic/vgic-v4.c
>>
>> @@ -208,6 +208,8 @@ int vgic_v4_sync_hwstate(struct kvm_vcpu *vcpu)
>>
>>    if (!vgic_supports_direct_msis(vcpu->kvm))
>>
>>     return 0;
>>
>> +   __vgic_v3_write_hcr(vcpu->arch.vgic_cpu.vgic_v3.vgic_hcr &
>> ~ICH_HCR_EN);
> 
> And you've now crashed your non-VHE system by branching directly to code
> that cannot be executed at EL1.
> 
>>
>> +
>>
>>    return its_schedule_vpe(>arch.vgic_cpu.vgic_v3.its_vpe, false);
>>
>> }
>>
>> @@ -220,6 +222,12 @@ int vgic_v4_flush_hwstate(struct kvm_vcpu *vcpu)
>>
>>     return 0;
>>
>>     /*
>>
>> +   * Enable ICH_HCR_EL.En so that guest can accpet and handle direct
>>
>> +   * vlpi.
>>
>> +   */
>>
>> +   __vgic_v3_write_hcr(vcpu->arch.vgic_cpu.vgic_v3.vgic_hcr);
>>
>> +
>>
>> +   /*
>>
>>     * Before making the VPE resident, make sure the redistributor
>>
>>     * corresponding to our current CPU expects us here. See the
>>
>>     * doc in drivers/irqchip/irq-gic-v4.c to understand how this
>>
>> -- 
>>
>> 1.9.1
>>
>>  
>>
>>  
>>
> 
> Overall, this looks like the wrong approach. It is not the GICv4 code's
> job to do this, but the low-level code (either the load/put code for VHE
> or the save/restore code for non-VHE).
> 
> It would certainly help if you could describe which context you're in
> (VHE, non-VHE). I suppose the former...

Can you please give the following patch a go? I can't test it, but hopefully
you can.

Thanks,

M.

diff --git a/virt/kvm/arm/hyp/vgic-v3-sr.c b/virt/kvm/arm/hyp/vgic-v3-sr.c
index 9652c453480f..3c3f7cda95c7 100644
--- a/virt/kvm/arm/hyp/vgic-v3-sr.c
+++ b/virt/kvm/arm/hyp/vgic-v3-sr.c
@@ -222,7 +222,7 @@ void __hyp_text __vgic_v3_save_state(struct kvm_vcpu *vcpu)
}
}
 
-   if (used_lrs) {
+   if (used_lrs || cpu_if->its_vpe.its_vm) {
int i;
u32 elrsr;
 
@@ -247,7 +247,7 @@ void __hyp_text __vgic_v3_restore_state(struct kvm_vcpu 
*vcpu)
u64 used_lrs = vcpu->arch.vgic_cpu.used_lrs;
int i;
 
-   if (used_lrs) {
+   if (used_lrs || cpu_if->its_vpe.its_vm) {
write_gicreg(cpu_if->vgic_hcr, ICH_HCR_EL2);
 
for (i = 0; i < used_lrs; i++)
diff --git a/virt/kvm/arm/vgic/vgic.c b/virt/kvm/arm/vgic/vgic.c
index abd9c7352677..3af69f2a3866 100644
--- a/virt/kvm/arm/vgic/vgic.c
+++ b/virt/kvm/arm/vgic/vgic.c
@@ -867,15 +867,21 @@ void 

Re: Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2

2019-03-13 Thread Leo Yan
On Wed, Mar 13, 2019 at 11:24:25AM +0100, Auger Eric wrote:

[...]

> >> I am stucking at the network card cannot receive interrupts in guest
> >> OS.  So took time to look into the code and added some printed info to
> >> help me to understand the detailed flow, below are two main questions
> >> I am confused with them and need some guidance:
> >>
> >> - The first question is about the msi usage in network card driver;
> >>   when review the sky2 network card driver [1], it has function
> >>   sky2_test_msi() which is used to decide if can use msi or not.
> >>
> >>   The interesting thing is this function will firstly request irq for
> >>   the interrupt and based on the interrupt handler to read back
> >>   register and then can make decision if msi is avalible or not.
> >>
> >>   This can work well for host OS, but if we want to passthrough this
> >>   device to guest OS, since the KVM doesn't prepare the interrupt for
> >>   sky2 drivers (no injection or forwarding) thus at this point the
> >>   interrupt handle will not be invorked.  At the end the driver will
> >>   not set flag 'hw->flags |= SKY2_HW_USE_MSI' and this results to not
> >>   use msi in guest OS and rollback to INTx mode.
> >>
> >>   My first impression is if we passthrough the devices to guest OS in
> >>   KVM, the PCI-e device can directly use msi;  I tweaked a bit for the
> >>   code to check status value after timeout, so both host OS and guest
> >>   OS can set the flag for msi.
> >>
> >>   I want to confirm, if this is the recommended mode for
> >>   passthrough PCI-e device to use msi both in host OS and geust OS?
> >>   Or it's will be fine for host OS using msi and guest OS using
> >>   INTx mode?
> > 
> > If the NIC supports MSIs they logically are used. This can be easily
> > checked on host by issuing "cat /proc/interrupts | grep vfio". Can you
> > check whether the guest received any interrupt? I remember that Robin
> > said in the past that on Juno, the MSI doorbell was in the PCI host
> > bridge window and possibly transactions towards the doorbell could not
> > reach it since considered as peer to peer.
> 
> I found back Robin's explanation. It was not related to MSI IOVA being
> within the PCI host bridge window but RAM GPA colliding with host PCI
> config space?
> 
> "MSI doorbells integral to PCIe root complexes (and thus untranslatable)
> typically have a programmable address, so could be anywhere. In the more
> general category of "special hardware addresses", QEMU's default ARM
> guest memory map puts RAM starting at 0x4000; on the ARM Juno
> platform, that happens to be where PCI config space starts; as Juno's
> PCIe doesn't support ACS, peer-to-peer or anything clever, if you assign
> the PCI bus to a guest (all of it, given the lack of ACS), the root
> complex just sees the guest's attempts to DMA to "memory" as the device
> attempting to access config space and aborts them."

Thanks a lot for the info, Eric.

Seems to me, this issue can be bypassed by using other memory address
rather than 0x4000 for kernel IPA thus can avoid colliding with
host PCI config space.

Robin, just curious have you tried to change guest memory start address
for bypassing this issue?  Or tried kvmtool for on Juno-r2 board (e.g.
kvmtool uses 0x400 for AXI bus and 0x8000 for RAM, we can do
some quickly shrinking thus can don't touch 0x4000 region?)

Thanks,
Leo Yan
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2

2019-03-13 Thread Leo Yan
Hi Eric,

On Wed, Mar 13, 2019 at 11:01:33AM +0100, Auger Eric wrote:

[...]

> >   I want to confirm, if this is the recommended mode for
> >   passthrough PCI-e device to use msi both in host OS and geust OS?
> >   Or it's will be fine for host OS using msi and guest OS using
> >   INTx mode?
> 
> If the NIC supports MSIs they logically are used. This can be easily
> checked on host by issuing "cat /proc/interrupts | grep vfio". Can you
> check whether the guest received any interrupt? I remember that Robin
> said in the past that on Juno, the MSI doorbell was in the PCI host
> bridge window and possibly transactions towards the doorbell could not
> reach it since considered as peer to peer. Using GICv2M should not bring
> any performance issue. I tested that in the past with Seattle board.

I can see below info on host with launching KVM:

root@debian:~# cat /proc/interrupts | grep vfio
 46:  0  0  0  0  0  0   
MSI 4194304 Edge  vfio-msi[0](:08:00.0)

And below is interrupts in guest:

# cat /proc/interrupts
   CPU0   CPU1   CPU2   CPU3   CPU4   CPU5
  3:506400281403298330 
GIC-0  27 Level arch_timer
  5:768  0  0  0  0  0 
GIC-0 101 Edge  virtio0
  6:246  0  0  0  0  0 
GIC-0 102 Edge  virtio1
  7:  2  0  0  0  0  0 
GIC-0 103 Edge  virtio2
  8:210  0  0  0  0  0 
GIC-0  97 Level ttyS0
 13:  0  0  0  0  0  0   
MSI   0 Edge  eth1

> > - The second question is for GICv2m.  If I understand correctly, when
> >   passthrough PCI-e device to guest OS, in the guest OS we should
> >   create below data path for PCI-e devices:
> > ++
> >  -> | Memory |
> > +---++--++---+  /   ++
> > | Net card  | -> | PCI-e controller | -> | IOMMU | -
> > +---++--++---+  \   ++
> >  -> | MSI|
> > | frame  |
> > ++
> > 
> >   Since now the master is network card/PCI-e controller but not CPU,
> >   thus there have no 2 stages for memory accessing (VA->IPA->PA).  In
> >   this case, if we configure IOMMU (SMMU) for guest OS for address
> >   translation before switch from host to guest, right?  Or SMMU also
> >   have two stages memory mapping?
> 
> in your use case you don't have any virtual IOMMU. So the guest programs
> the assigned device with guest physical device and the virtualizer uses
> the physical IOMMU to translate this GPA into host physical address
> backing the guest RAM and the MSI frame. A single stage of the physical
> IOMMU is used (stage1).

Thanks a lot for the explaination.

> >   Another thing confuses me is I can see the MSI frame is mapped to
> >   GIC's physical address in host OS, thus the PCI-e device can send
> >   message correctly to msi frame.  But for guest OS, the MSI frame is
> >   mapped to one IPA memory region, and this region is use to emulate
> >   GICv2 msi frame rather than the hardware msi frame; thus will any
> >   access from PCI-e to this region will trap to hypervisor in CPU
> >   side so KVM hyperviso can help emulate (and inject) the interrupt
> >   for guest OS?
> 
> when the device sends an MSI it uses a host allocated IOVA for the
> physical MSI doorbell. This gets translated by the physical IOMMU,
> reaches the physical doorbell. The physical GICv2m triggers the
> associated physical SPI -> kvm irqfd -> virtual IRQ
> With GICv2M we have direct GSI mapping on guest.

Just want to confirm, in your elaborated flow the virtual IRQ will be
injected by qemu (or kvmtool) for every time but it's not needed to
interfere with IRQ's deactivation, right?

> >   Essentially, I want to check what's the expected behaviour for GICv2
> >   msi frame working mode when we want to passthrough one PCI-e device
> >   to guest OS and the PCI-e device has one static msi frame for it.
> 
> Your config was tested in the past with Seattle (not with sky2 NIC
> though). Adding Robin for the peer to peer potential concern.

Very appreciate for your help.

Thanks,
Leo Yan
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2

2019-03-13 Thread Auger Eric
Hi,

On 3/13/19 11:01 AM, Auger Eric wrote:
> Hi Leo,
> 
> + Robin
> 
> On 3/13/19 9:00 AM, Leo Yan wrote:
>> Hi Eric & all,
>>
>> On Mon, Mar 11, 2019 at 10:35:01PM +0800, Leo Yan wrote:
>>
>> [...]
>>
>>> So now I made some progress and can see the networking card is
>>> pass-through to guest OS, though the networking card reports errors
>>> now.  Below is detailed steps and info:
>>>
>>> - Bind devices in the same IOMMU group to vfio driver:
>>>
>>>   echo :03:00.0 > /sys/bus/pci/devices/\:03\:00.0/driver/unbind
>>>   echo 1095 3132 > /sys/bus/pci/drivers/vfio-pci/new_id
>>>
>>>   echo :08:00.0 > /sys/bus/pci/devices/\:08\:00.0/driver/unbind
>>>   echo 11ab 4380 > /sys/bus/pci/drivers/vfio-pci/new_id
>>>
>>> - Enable 'allow_unsafe_interrupts=1' for module vfio_iommu_type1;
>>>
>>> - Use qemu to launch guest OS:
>>>
>>>   qemu-system-aarch64 \
>>> -cpu host -M virt,accel=kvm -m 4096 -nographic \
>>> -kernel /root/virt/Image -append root=/dev/vda2 \
>>> -net none -device vfio-pci,host=08:00.0 \
>>> -drive if=virtio,file=/root/virt/qemu/debian.img \
>>> -append 'loglevel=8 root=/dev/vda2 rw console=ttyAMA0 earlyprintk 
>>> ip=dhcp'
>>>
>>> - Host log:
>>>
>>> [  188.329861] vfio-pci :08:00.0: enabling device ( -> 0003)
>>>
>>> - Below is guest log, from log though the driver has been registered but
>>>   it reports PCI hardware failure and the timeout for the interrupt.
>>>
>>>   So is this caused by very 'slow' forward interrupt handling?  Juno
>>>   board uses GICv2 (I think it has GICv2m extension).
>>>
>>> [...]
>>>
>>> [1.024483] sky2 :00:01.0 eth0: enabling interface
>>> [1.026822] sky2 :00:01.0: error interrupt status=0x8000
>>> [1.029155] sky2 :00:01.0: PCI hardware error (0x1010)
>>> [4.000699] sky2 :00:01.0 eth0: Link is up at 1000 Mbps, full 
>>> duplex, flow control both
>>> [4.026116] Sending DHCP requests .
>>> [4.026201] sky2 :00:01.0: error interrupt status=0x8000
>>> [4.030043] sky2 :00:01.0: PCI hardware error (0x1010)
>>> [6.546111] ..
>>> [   14.118106] [ cut here ]
>>> [   14.120672] NETDEV WATCHDOG: eth0 (sky2): transmit queue 0 timed out
>>> [   14.123555] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:461 
>>> dev_watchdog+0x2b4/0x2c0
>>> [   14.127082] Modules linked in:
>>> [   14.128631] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
>>> 5.0.0-rc8-00061-ga98f9a047756-dirty #
>>> [   14.132800] Hardware name: linux,dummy-virt (DT)
>>> [   14.135082] pstate: 6005 (nZCv daif -PAN -UAO)
>>> [   14.137459] pc : dev_watchdog+0x2b4/0x2c0
>>> [   14.139457] lr : dev_watchdog+0x2b4/0x2c0
>>> [   14.141351] sp : 10003d70
>>> [   14.142924] x29: 10003d70 x28: 112f60c0
>>> [   14.145433] x27: 0140 x26: 8000fa6eb3b8
>>> [   14.147936] x25:  x24: 8000fa7a7c80
>>> [   14.150428] x23: 8000fa6eb39c x22: 8000fa6eafb8
>>> [   14.152934] x21: 8000fa6eb000 x20: 112f7000
>>> [   14.155437] x19:  x18: 
>>> [   14.157929] x17:  x16: 
>>> [   14.160432] x15: 112fd6c8 x14: 90003a97
>>> [   14.162927] x13: 10003aa5 x12: 11315878
>>> [   14.165428] x11: 11315000 x10: 05f5e0ff
>>> [   14.167935] x9 : ffd0 x8 : 64656d6974203020
>>> [   14.170430] x7 : 6575657571207469 x6 : 00e3
>>> [   14.172935] x5 :  x4 : 
>>> [   14.175443] x3 :  x2 : 113158a8
>>> [   14.177938] x1 : f2db9128b1f08600 x0 : 
>>> [   14.180443] Call trace:
>>> [   14.181625]  dev_watchdog+0x2b4/0x2c0
>>> [   14.183377]  call_timer_fn+0x20/0x78
>>> [   14.185078]  expire_timers+0xa4/0xb0
>>> [   14.186777]  run_timer_softirq+0xa0/0x190
>>> [   14.188687]  __do_softirq+0x108/0x234
>>> [   14.190428]  irq_exit+0xcc/0xd8
>>> [   14.191941]  __handle_domain_irq+0x60/0xb8
>>> [   14.193877]  gic_handle_irq+0x58/0xb0
>>> [   14.195630]  el1_irq+0xb0/0x128
>>> [   14.197132]  arch_cpu_idle+0x10/0x18
>>> [   14.198835]  do_idle+0x1cc/0x288
>>> [   14.200389]  cpu_startup_entry+0x24/0x28
>>> [   14.202251]  rest_init+0xd4/0xe0
>>> [   14.203804]  arch_call_rest_init+0xc/0x14
>>> [   14.205702]  start_kernel+0x3d8/0x404
>>> [   14.207449] ---[ end trace 65449acd5c054609 ]---
>>> [   14.209630] sky2 :00:01.0 eth0: tx timeout
>>> [   14.211655] sky2 :00:01.0 eth0: transmit ring 0 .. 3 report=0 done=0
>>> [   17.906956] sky2 :00:01.0 eth0: Link is up at 1000 Mbps, full 
>>> duplex, flow control both
>>
>> I am stucking at the network card cannot receive interrupts in guest
>> OS.  So took time to look into the code and added some printed info to
>> help me to understand the detailed flow, below are two main questions
>> I am confused with them and need some guidance:
>>
>> - The first question is about the 

Re: Re: [RFC] arm/cpu: fix soft lockup panic after resuming from stop

2019-03-13 Thread Steven Price
On 13/03/2019 01:57, Heyi Guo wrote:
> Dear all,
> 
> Really appreciate your comments and information. For "Live Physical
> Time", is there any document to tell the whole story? And may I know the
> timeline to make it happen?

The documentation is available here:
https://developer.arm.com/docs/den0057/a

I also posted an RFC for Linux support last year:
http://lists.infradead.org/pipermail/linux-arm-kernel/2018-December/620338.html

In terms of timeline I don't have anything definite. We need to get the
spec nailed down before much progress can be made on the code. And at
the moment Live Physical Time (LPT) isn't that useful on it's own
because there are a number of other issues with migrating an arm64 guest
from one host to another which has different hardware (e.g. handling
different CPU errata).

Steve
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2

2019-03-13 Thread Auger Eric
Hi Leo,

On 3/13/19 11:01 AM, Leo Yan wrote:
> On Wed, Mar 13, 2019 at 04:00:48PM +0800, Leo Yan wrote:
> 
> [...]
> 
>> - The second question is for GICv2m.  If I understand correctly, when
>>   passthrough PCI-e device to guest OS, in the guest OS we should
>>   create below data path for PCI-e devices:
>> ++
>>  -> | Memory |
>> +---++--++---+  /   ++
>> | Net card  | -> | PCI-e controller | -> | IOMMU | -
>> +---++--++---+  \   ++
>>  -> | MSI|
>> | frame  |
>> ++
>>
>>   Since now the master is network card/PCI-e controller but not CPU,
>>   thus there have no 2 stages for memory accessing (VA->IPA->PA).  In
>>   this case, if we configure IOMMU (SMMU) for guest OS for address
>>   translation before switch from host to guest, right?  Or SMMU also
>>   have two stages memory mapping?
>>
>>   Another thing confuses me is I can see the MSI frame is mapped to
>>   GIC's physical address in host OS, thus the PCI-e device can send
>>   message correctly to msi frame.  But for guest OS, the MSI frame is
>>   mapped to one IPA memory region, and this region is use to emulate
>>   GICv2 msi frame rather than the hardware msi frame; thus will any
>>   access from PCI-e to this region will trap to hypervisor in CPU
>>   side so KVM hyperviso can help emulate (and inject) the interrupt
>>   for guest OS?
>>
>>   Essentially, I want to check what's the expected behaviour for GICv2
>>   msi frame working mode when we want to passthrough one PCI-e device
>>   to guest OS and the PCI-e device has one static msi frame for it.
> 
> From the blog [1], it has below explanation for my question for mapping
> IOVA and hardware msi address.  But I searched the flag
> VFIO_DMA_FLAG_MSI_RESERVED_IOVA which isn't found in mainline kernel;
> I might miss something for this, want to check if related patches have
> been merged in the mainline kernel?

Yes all the mechanics for passthrough/MSI on ARM is upstream. The blog
page is outdated. The kernel allocates IOVAs for MSI doorbells
arbitrarily within this region.

#define MSI_IOVA_BASE   0x800
#define MSI_IOVA_LENGTH 0x10

and userspace is not involved anymore in passing a usable reserved IOVA
region.

Thanks

Eric
> 
> 'We reuse the VFIO DMA MAP ioctl to pass this reserved IOVA region. A
> new flag (VFIO_DMA_FLAG_MSI_RESERVED_IOVA ) is introduced to
> differentiate such reserved IOVA from RAM IOVA. Then the base/size of
> the window is passed to the IOMMU driver though a new function
> introduced in the IOMMU API. 
> 
> The IOVA allocation within the supplied reserved IOVA window is
> performed on-demand, when the MSI controller composes/writes the MSI
> message in the PCIe device. Also the IOMMU mapping between the newly
> allocated IOVA and the backdoor address page is done at that time. The
> MSI controller uses a new function introduced in the IOMMU API to
> allocate the IOVA and create an IOMMU mapping.
>  
> So there are adaptations needed at VFIO, IOMMU and MSI controller
> level. The extension of the IOMMU API still is under discussion. Also
> changes at MSI controller level need to be consolidated.'
> 
> P.s. I also tried two tools qemu/kvmtool, both cannot pass interrupt
> for network card in guest OS.
> 
> Thanks,
> Leo Yan
> 
> [1] https://www.linaro.org/blog/kvm-pciemsi-passthrough-armarm64/
> 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC] arm/cpu: fix soft lockup panic after resuming from stop

2019-03-13 Thread Steven Price
On 12/03/2019 14:12, Marc Zyngier wrote:
> Hi Peter,
> 
> On 12/03/2019 10:08, Peter Maydell wrote:
>> On Tue, 12 Mar 2019 at 06:10, Heyi Guo  wrote:
>>>
>>> When we stop a VM for more than 30 seconds and then resume it, by qemu
>>> monitor command "stop" and "cont", Linux on VM will complain of "soft
>>> lockup - CPU#x stuck for xxs!" as below:
>>>
>>> [ 2783.809517] watchdog: BUG: soft lockup - CPU#3 stuck for 2395s!
>>> [ 2783.809559] watchdog: BUG: soft lockup - CPU#2 stuck for 2395s!
>>> [ 2783.809561] watchdog: BUG: soft lockup - CPU#1 stuck for 2395s!
>>> [ 2783.809563] Modules linked in...
>>>
>>> This is because Guest Linux uses generic timer virtual counter as
>>> a software watchdog, and CNTVCT_EL0 does not stop when VM is stopped
>>> by qemu.
>>>
>>> This patch is to fix this issue by saving the value of CNTVCT_EL0 when
>>> stopping and restoring it when resuming.

An alternative way of fixing this particular issue ("stop"/"cont"
commands in QEMU) would be to wire up KVM_KVMCLOCK_CTRL for arm to allow
QEMU to signal to the guest that it was forcibly stopped for a while
(and so the watchdog timeout can be ignored by the guest).

>> Hi -- I know we have issues with the passage of time in Arm VMs
>> running under KVM when the VM is suspended, but the topic is
>> a tricky one, and it's not clear to me that this is the correct
>> way to fix it. I would prefer to see us start with a discussion
>> on the kvm-arm mailing list about the best approach to the problem.
>>
>> I've cc'd that list and a couple of the Arm KVM maintainers
>> for their opinion.
>>
>> QEMU patch left below for context -- the brief summary is that
>> it uses KVM_GET_ONE_REG/KVM_SET_ONE_REG on the timer CNT register
>> to save it on VM pause and write that value back on VM resume.
> 
> That's probably good enough for this particular use case, but I think
> there is more. I can get into similar trouble if I suspend my laptop, or
> suspend QEMU. It also has the slightly bizarre effect of skewing time,
> and this will affect timekeeping in the guest in ways that are much more
> subtle than just shouty CPUs.

Indeed this is the bigger issue - user space doesn't get an opportunity
to be involved when suspending/resuming, so saving/restoring (or using
KVM_KVMCLOCK_CTRL) in user space won't fix these cases.

> Christoffer and Steve had some stuff regarding Live Physical Time, which
> should cover that, and other cases such as host system suspend, and QEMU
> being suspended.

Live Physical Time (LPT) is only part of the solution - this handles the
mess that otherwise would occur when moving to a new host with a
different clock frequency.

Personally I think what we need is:

* Either a patch like the one from Heyi Guo (save/restore CNTVCT_EL0) or
alternatively hooking up KVM_KVMCLOCK_CTRL to prevent the watchdog
firing when user space explicitly stops scheduling the guest for a while.

* KVM itself saving/restoring CNTVCT_EL0 during suspend/resume so the
guest doesn't see time pass during a suspend.

* Something equivalent to MSR_KVM_WALL_CLOCK_NEW for arm which allows
the guest to query the wall clock time from the host and provides an
offset between CNTVCT_EL0 to wall clock time which the KVM can update
during suspend/resume. This means that during a suspend/resume the guest
can observe that wall clock time has passed, without having to be
bothered about CNTVCT_EL0 jumping forwards.

Steve
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2

2019-03-13 Thread Auger Eric
Hi Leo,

+ Robin

On 3/13/19 9:00 AM, Leo Yan wrote:
> Hi Eric & all,
> 
> On Mon, Mar 11, 2019 at 10:35:01PM +0800, Leo Yan wrote:
> 
> [...]
> 
>> So now I made some progress and can see the networking card is
>> pass-through to guest OS, though the networking card reports errors
>> now.  Below is detailed steps and info:
>>
>> - Bind devices in the same IOMMU group to vfio driver:
>>
>>   echo :03:00.0 > /sys/bus/pci/devices/\:03\:00.0/driver/unbind
>>   echo 1095 3132 > /sys/bus/pci/drivers/vfio-pci/new_id
>>
>>   echo :08:00.0 > /sys/bus/pci/devices/\:08\:00.0/driver/unbind
>>   echo 11ab 4380 > /sys/bus/pci/drivers/vfio-pci/new_id
>>
>> - Enable 'allow_unsafe_interrupts=1' for module vfio_iommu_type1;
>>
>> - Use qemu to launch guest OS:
>>
>>   qemu-system-aarch64 \
>> -cpu host -M virt,accel=kvm -m 4096 -nographic \
>> -kernel /root/virt/Image -append root=/dev/vda2 \
>> -net none -device vfio-pci,host=08:00.0 \
>> -drive if=virtio,file=/root/virt/qemu/debian.img \
>> -append 'loglevel=8 root=/dev/vda2 rw console=ttyAMA0 earlyprintk 
>> ip=dhcp'
>>
>> - Host log:
>>
>> [  188.329861] vfio-pci :08:00.0: enabling device ( -> 0003)
>>
>> - Below is guest log, from log though the driver has been registered but
>>   it reports PCI hardware failure and the timeout for the interrupt.
>>
>>   So is this caused by very 'slow' forward interrupt handling?  Juno
>>   board uses GICv2 (I think it has GICv2m extension).
>>
>> [...]
>>
>> [1.024483] sky2 :00:01.0 eth0: enabling interface
>> [1.026822] sky2 :00:01.0: error interrupt status=0x8000
>> [1.029155] sky2 :00:01.0: PCI hardware error (0x1010)
>> [4.000699] sky2 :00:01.0 eth0: Link is up at 1000 Mbps, full duplex, 
>> flow control both
>> [4.026116] Sending DHCP requests .
>> [4.026201] sky2 :00:01.0: error interrupt status=0x8000
>> [4.030043] sky2 :00:01.0: PCI hardware error (0x1010)
>> [6.546111] ..
>> [   14.118106] [ cut here ]
>> [   14.120672] NETDEV WATCHDOG: eth0 (sky2): transmit queue 0 timed out
>> [   14.123555] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:461 
>> dev_watchdog+0x2b4/0x2c0
>> [   14.127082] Modules linked in:
>> [   14.128631] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
>> 5.0.0-rc8-00061-ga98f9a047756-dirty #
>> [   14.132800] Hardware name: linux,dummy-virt (DT)
>> [   14.135082] pstate: 6005 (nZCv daif -PAN -UAO)
>> [   14.137459] pc : dev_watchdog+0x2b4/0x2c0
>> [   14.139457] lr : dev_watchdog+0x2b4/0x2c0
>> [   14.141351] sp : 10003d70
>> [   14.142924] x29: 10003d70 x28: 112f60c0
>> [   14.145433] x27: 0140 x26: 8000fa6eb3b8
>> [   14.147936] x25:  x24: 8000fa7a7c80
>> [   14.150428] x23: 8000fa6eb39c x22: 8000fa6eafb8
>> [   14.152934] x21: 8000fa6eb000 x20: 112f7000
>> [   14.155437] x19:  x18: 
>> [   14.157929] x17:  x16: 
>> [   14.160432] x15: 112fd6c8 x14: 90003a97
>> [   14.162927] x13: 10003aa5 x12: 11315878
>> [   14.165428] x11: 11315000 x10: 05f5e0ff
>> [   14.167935] x9 : ffd0 x8 : 64656d6974203020
>> [   14.170430] x7 : 6575657571207469 x6 : 00e3
>> [   14.172935] x5 :  x4 : 
>> [   14.175443] x3 :  x2 : 113158a8
>> [   14.177938] x1 : f2db9128b1f08600 x0 : 
>> [   14.180443] Call trace:
>> [   14.181625]  dev_watchdog+0x2b4/0x2c0
>> [   14.183377]  call_timer_fn+0x20/0x78
>> [   14.185078]  expire_timers+0xa4/0xb0
>> [   14.186777]  run_timer_softirq+0xa0/0x190
>> [   14.188687]  __do_softirq+0x108/0x234
>> [   14.190428]  irq_exit+0xcc/0xd8
>> [   14.191941]  __handle_domain_irq+0x60/0xb8
>> [   14.193877]  gic_handle_irq+0x58/0xb0
>> [   14.195630]  el1_irq+0xb0/0x128
>> [   14.197132]  arch_cpu_idle+0x10/0x18
>> [   14.198835]  do_idle+0x1cc/0x288
>> [   14.200389]  cpu_startup_entry+0x24/0x28
>> [   14.202251]  rest_init+0xd4/0xe0
>> [   14.203804]  arch_call_rest_init+0xc/0x14
>> [   14.205702]  start_kernel+0x3d8/0x404
>> [   14.207449] ---[ end trace 65449acd5c054609 ]---
>> [   14.209630] sky2 :00:01.0 eth0: tx timeout
>> [   14.211655] sky2 :00:01.0 eth0: transmit ring 0 .. 3 report=0 done=0
>> [   17.906956] sky2 :00:01.0 eth0: Link is up at 1000 Mbps, full duplex, 
>> flow control both
> 
> I am stucking at the network card cannot receive interrupts in guest
> OS.  So took time to look into the code and added some printed info to
> help me to understand the detailed flow, below are two main questions
> I am confused with them and need some guidance:
> 
> - The first question is about the msi usage in network card driver;
>   when review the sky2 network card driver [1], it has function
>   sky2_test_msi() which is used to decide if can use 

Re: Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2

2019-03-13 Thread Leo Yan
On Wed, Mar 13, 2019 at 04:00:48PM +0800, Leo Yan wrote:

[...]

> - The second question is for GICv2m.  If I understand correctly, when
>   passthrough PCI-e device to guest OS, in the guest OS we should
>   create below data path for PCI-e devices:
> ++
>  -> | Memory |
> +---++--++---+  /   ++
> | Net card  | -> | PCI-e controller | -> | IOMMU | -
> +---++--++---+  \   ++
>  -> | MSI|
> | frame  |
> ++
> 
>   Since now the master is network card/PCI-e controller but not CPU,
>   thus there have no 2 stages for memory accessing (VA->IPA->PA).  In
>   this case, if we configure IOMMU (SMMU) for guest OS for address
>   translation before switch from host to guest, right?  Or SMMU also
>   have two stages memory mapping?
> 
>   Another thing confuses me is I can see the MSI frame is mapped to
>   GIC's physical address in host OS, thus the PCI-e device can send
>   message correctly to msi frame.  But for guest OS, the MSI frame is
>   mapped to one IPA memory region, and this region is use to emulate
>   GICv2 msi frame rather than the hardware msi frame; thus will any
>   access from PCI-e to this region will trap to hypervisor in CPU
>   side so KVM hyperviso can help emulate (and inject) the interrupt
>   for guest OS?
> 
>   Essentially, I want to check what's the expected behaviour for GICv2
>   msi frame working mode when we want to passthrough one PCI-e device
>   to guest OS and the PCI-e device has one static msi frame for it.

>From the blog [1], it has below explanation for my question for mapping
IOVA and hardware msi address.  But I searched the flag
VFIO_DMA_FLAG_MSI_RESERVED_IOVA which isn't found in mainline kernel;
I might miss something for this, want to check if related patches have
been merged in the mainline kernel?

'We reuse the VFIO DMA MAP ioctl to pass this reserved IOVA region. A
new flag (VFIO_DMA_FLAG_MSI_RESERVED_IOVA ) is introduced to
differentiate such reserved IOVA from RAM IOVA. Then the base/size of
the window is passed to the IOMMU driver though a new function
introduced in the IOMMU API. 

The IOVA allocation within the supplied reserved IOVA window is
performed on-demand, when the MSI controller composes/writes the MSI
message in the PCIe device. Also the IOMMU mapping between the newly
allocated IOVA and the backdoor address page is done at that time. The
MSI controller uses a new function introduced in the IOMMU API to
allocate the IOVA and create an IOMMU mapping.
 
So there are adaptations needed at VFIO, IOMMU and MSI controller
level. The extension of the IOMMU API still is under discussion. Also
changes at MSI controller level need to be consolidated.'

P.s. I also tried two tools qemu/kvmtool, both cannot pass interrupt
for network card in guest OS.

Thanks,
Leo Yan

[1] https://www.linaro.org/blog/kvm-pciemsi-passthrough-armarm64/
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC] Question about TLB flush while set Stage-2 huge pages

2019-03-13 Thread Zheng Xiang



On 2019/3/13 2:18, Marc Zyngier wrote:
> Hi Zheng,
> 
> On 12/03/2019 15:30, Zheng Xiang wrote:
>> Hi Marc,
>>
>> On 2019/3/12 19:32, Marc Zyngier wrote:
>>> Hi Zheng,
>>>
>>> On 11/03/2019 16:31, Zheng Xiang wrote:
 Hi all,

 While a page is merged into a transparent huge page, KVM will invalidate 
 Stage-2 for
 the base address of the huge page and the whole of Stage-1.
 However, this just only invalidates the first page within the huge page 
 and the other
 pages are not invalidated, see bellow:

 +---+--+
 |abcde   2MB-Page  |
 +---+--+

 TLB before setting new pmd:
 +---+--+
 |  VA   |PAGESIZE  |
 +---+--+
 |  a|  4KB |
 +---+--+
 |  b|  4KB |
 +---+--+
 |  c|  4KB |
 +---+--+
 |  d|  4KB |
 +---+--+

 TLB after setting new pmd:
 +---+--+
 |  VA   |PAGESIZE  |
 +---+--+
 |  a|  2MB |
 +---+--+
 |  b|  4KB |
 +---+--+
 |  c|  4KB |
 +---+--+
 |  d|  4KB |
 +---+--+

 When VM access *b* address, it will hit the TLB and result in TLB conflict 
 aborts or other potential exceptions.
>>>
>>> That's really bad. I can only imagine two scenarios:
>>>
>>> 1) We fail to unmap a,b,c,d (and potentially another 508 PTEs), loosing
>>> the PTE table in the process, and place the PMD instead. I can't see
>>> this happening.
>>>
>>> 2) We fail to invalidate on unmap, and that slightly less bad (but still
>>> quite bad).
>>>
>>> Which of the two cases are you seeing?
>>>
 For example, we need to keep tracking of the VM memory dirty pages when VM 
 is in live migration.
 KVM will set the memslot READONLY and split the huge pages.
 After live migration is canceled and abort, the pages will be merged into 
 THP.
 The later access to these pages which are READONLY will cause level-3 
 Permission Fault until they are invalidated.

 So should we invalidate the tlb entries for all relative pages(e.g 
 a,b,c,d), like __flush_tlb_range()?
 Or we can call __kvm_tlb_flush_vmid() to invalidate all tlb entries.
>>>
>>> We should perform an invalidate on each unmap. unmap_stage2_range seems
>>> to do the right thing. __flush_tlb_range only caters for Stage1
>>> mappings, and __kvm_tlb_flush_vmid() is too big a hammer, as it nukes
>>> TLBs for the whole VM.
>>>
>>> I'd really like to understand what you're seeing, and how to reproduce
>>> it. Do you have a minimal example I could run on my own HW?
>>
>> When I start the live migration for a VM, qemu then begins to log and count 
>> dirty pages.
>> During the live migration, KVM set the pages READONLY so that we can count 
>> how many pages
>> would be wrote afterwards.
>>
>> Anything is OK until I cancel the live migration and qemu stop logging. 
>> Later the VM gets hang.
>> The trace log shows repeatedly level-3 permission fault caused by a write on 
>> a same IPA. After
>> analyzing the source code, I find KVM always return from the bellow *if* 
>> statement in
>> stage2_set_pmd_huge() even if we only have a single VCPU:
>>
>> /*
>>  * Multiple vcpus faulting on the same PMD entry, can
>>  * lead to them sequentially updating the PMD with the
>>  * same value. Following the break-before-make
>>  * (pmd_clear() followed by tlb_flush()) process can
>>  * hinder forward progress due to refaults generated
>>  * on missing translations.
>>  *
>>  * Skip updating the page table if the entry is
>>  * unchanged.
>>  */
>> if (pmd_val(old_pmd) == pmd_val(*new_pmd))
>> return 0;
>>
>> The PMD has already set the PMD_S2_RDWR bit. I doubt 
>> kvm_tlb_flush_vmid_ipa() does not invalidate
>> Stage-2 for the subpages of the PMD(except the first PTE of this PMD). 
>> Finally I add some debug
>> code to flush tlb for all subpages of the PMD, as shown bellow:
>>
>> /*
>>  * Mapping in huge pages should only happen through a
>>  * fault.  If a page is merged into a transparent huge
>>  * page, the individual subpages of that huge page
>>  * should be unmapped through MMU notifiers before we
>>  * get here.
>>  *
>>  * Merging of CompoundPages is not supported; they
>>  * should become 

Re: Question: KVM: Failed to bind vfio with PCI-e / SMMU on Juno-r2

2019-03-13 Thread Leo Yan
Hi Eric & all,

On Mon, Mar 11, 2019 at 10:35:01PM +0800, Leo Yan wrote:

[...]

> So now I made some progress and can see the networking card is
> pass-through to guest OS, though the networking card reports errors
> now.  Below is detailed steps and info:
> 
> - Bind devices in the same IOMMU group to vfio driver:
> 
>   echo :03:00.0 > /sys/bus/pci/devices/\:03\:00.0/driver/unbind
>   echo 1095 3132 > /sys/bus/pci/drivers/vfio-pci/new_id
> 
>   echo :08:00.0 > /sys/bus/pci/devices/\:08\:00.0/driver/unbind
>   echo 11ab 4380 > /sys/bus/pci/drivers/vfio-pci/new_id
> 
> - Enable 'allow_unsafe_interrupts=1' for module vfio_iommu_type1;
> 
> - Use qemu to launch guest OS:
> 
>   qemu-system-aarch64 \
> -cpu host -M virt,accel=kvm -m 4096 -nographic \
> -kernel /root/virt/Image -append root=/dev/vda2 \
> -net none -device vfio-pci,host=08:00.0 \
> -drive if=virtio,file=/root/virt/qemu/debian.img \
> -append 'loglevel=8 root=/dev/vda2 rw console=ttyAMA0 earlyprintk 
> ip=dhcp'
> 
> - Host log:
> 
> [  188.329861] vfio-pci :08:00.0: enabling device ( -> 0003)
> 
> - Below is guest log, from log though the driver has been registered but
>   it reports PCI hardware failure and the timeout for the interrupt.
> 
>   So is this caused by very 'slow' forward interrupt handling?  Juno
>   board uses GICv2 (I think it has GICv2m extension).
> 
> [...]
> 
> [1.024483] sky2 :00:01.0 eth0: enabling interface
> [1.026822] sky2 :00:01.0: error interrupt status=0x8000
> [1.029155] sky2 :00:01.0: PCI hardware error (0x1010)
> [4.000699] sky2 :00:01.0 eth0: Link is up at 1000 Mbps, full duplex, 
> flow control both
> [4.026116] Sending DHCP requests .
> [4.026201] sky2 :00:01.0: error interrupt status=0x8000
> [4.030043] sky2 :00:01.0: PCI hardware error (0x1010)
> [6.546111] ..
> [   14.118106] [ cut here ]
> [   14.120672] NETDEV WATCHDOG: eth0 (sky2): transmit queue 0 timed out
> [   14.123555] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:461 
> dev_watchdog+0x2b4/0x2c0
> [   14.127082] Modules linked in:
> [   14.128631] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
> 5.0.0-rc8-00061-ga98f9a047756-dirty #
> [   14.132800] Hardware name: linux,dummy-virt (DT)
> [   14.135082] pstate: 6005 (nZCv daif -PAN -UAO)
> [   14.137459] pc : dev_watchdog+0x2b4/0x2c0
> [   14.139457] lr : dev_watchdog+0x2b4/0x2c0
> [   14.141351] sp : 10003d70
> [   14.142924] x29: 10003d70 x28: 112f60c0
> [   14.145433] x27: 0140 x26: 8000fa6eb3b8
> [   14.147936] x25:  x24: 8000fa7a7c80
> [   14.150428] x23: 8000fa6eb39c x22: 8000fa6eafb8
> [   14.152934] x21: 8000fa6eb000 x20: 112f7000
> [   14.155437] x19:  x18: 
> [   14.157929] x17:  x16: 
> [   14.160432] x15: 112fd6c8 x14: 90003a97
> [   14.162927] x13: 10003aa5 x12: 11315878
> [   14.165428] x11: 11315000 x10: 05f5e0ff
> [   14.167935] x9 : ffd0 x8 : 64656d6974203020
> [   14.170430] x7 : 6575657571207469 x6 : 00e3
> [   14.172935] x5 :  x4 : 
> [   14.175443] x3 :  x2 : 113158a8
> [   14.177938] x1 : f2db9128b1f08600 x0 : 
> [   14.180443] Call trace:
> [   14.181625]  dev_watchdog+0x2b4/0x2c0
> [   14.183377]  call_timer_fn+0x20/0x78
> [   14.185078]  expire_timers+0xa4/0xb0
> [   14.186777]  run_timer_softirq+0xa0/0x190
> [   14.188687]  __do_softirq+0x108/0x234
> [   14.190428]  irq_exit+0xcc/0xd8
> [   14.191941]  __handle_domain_irq+0x60/0xb8
> [   14.193877]  gic_handle_irq+0x58/0xb0
> [   14.195630]  el1_irq+0xb0/0x128
> [   14.197132]  arch_cpu_idle+0x10/0x18
> [   14.198835]  do_idle+0x1cc/0x288
> [   14.200389]  cpu_startup_entry+0x24/0x28
> [   14.202251]  rest_init+0xd4/0xe0
> [   14.203804]  arch_call_rest_init+0xc/0x14
> [   14.205702]  start_kernel+0x3d8/0x404
> [   14.207449] ---[ end trace 65449acd5c054609 ]---
> [   14.209630] sky2 :00:01.0 eth0: tx timeout
> [   14.211655] sky2 :00:01.0 eth0: transmit ring 0 .. 3 report=0 done=0
> [   17.906956] sky2 :00:01.0 eth0: Link is up at 1000 Mbps, full duplex, 
> flow control both

I am stucking at the network card cannot receive interrupts in guest
OS.  So took time to look into the code and added some printed info to
help me to understand the detailed flow, below are two main questions
I am confused with them and need some guidance:

- The first question is about the msi usage in network card driver;
  when review the sky2 network card driver [1], it has function
  sky2_test_msi() which is used to decide if can use msi or not.

  The interesting thing is this function will firstly request irq for
  the interrupt and based on the interrupt handler to read back
  register and then