Re: [PATCH v5 0/1] Implements MMIO emulation for lvx/stvx instructions
On Fri, Feb 02, 2018 at 11:30:18AM +1100, Paul Mackerras wrote: > On Thu, Feb 01, 2018 at 04:15:38PM -0200, Jose Ricardo Ziviani wrote: > > v5: > > - Fixed the mask off of the effective address > > > > v4: > > - Changed KVM_MMIO_REG_VMX to 0xc0 because there are 64 VSX registers > > > > v3: > > - Added Reported-by in the commit message > > > > v2: > > - kvmppc_get_vsr_word_offset() moved back to its original place > > - EA AND ~0xF, following ISA. > > - fixed BE/LE cases > > > > TESTS: > > > > For testing purposes I wrote a small program that performs stvx/lvx using > > the > > program's virtual memory and using MMIO. Load/Store into virtual memory is > > the > > model I use to check if MMIO results are correct (because only MMIO is > > emulated > > by KVM). > > I'd be interested to see your test program because in my testing it's > still not right, unfortunately. Interestingly, it is right for the BE > guest on LE host case. However, with a LE guest on a LE host the two > halves are swapped, both for lvx and stvx: Absolutely, here it's: https://gist.github.com/jrziviani/a65e71c5d661bffa8afcd6710fedd520 It basically maps an IO region and also allocates some memory from the program's address space. Then I store to/load from both addresses and compare the results. Because only the mmio load/store are emulated, I use the regular load/store as a model. > > error in lvx at byte 0 > was: -> 62 69 70 77 7e 85 8c 93 2a 31 38 3f 46 4d 54 5b > ref: -> 2a 31 38 3f 46 4d 54 5b 62 69 70 77 7e 85 8c 93 > error in stvx at byte 0 > was: -> 49 50 57 5e 65 6c 73 7a 11 18 1f 26 2d 34 3b 42 > ref: -> 11 18 1f 26 2d 34 3b 42 49 50 57 5e 65 6c 73 7a > > The byte order within each 8-byte half is correct but the two halves > are swapped. ("was" is what was in memory and "ref" is the correct > value. For lvx it does lvx from emulated MMIO and stvx to ordinary > memory, and for stvx it does lvx from ordinary memory and stvx to > emulated MMIO. In both cases the checking is done with a byte by byte > comparison.) The funny thing is that I still see it right in both cases, so I believe that my test case is incorrect. Example (host LE, guest LE): > VR0 after lvx (gdb) p $vr0 {uint128 = 0x1234567887654321, v4_float = { -1.72477726e-34, -3.03283305e-13, 1.46555735e+13, 5.69045661e-28}, v4_int32 = {-2023406815, -1431651397, 1431651396, 305419896}, v8_int16 = { 17185, -30875, -17477, -21846, 17476, 21845, 22136, 4660}, v16_int8 = {33, 67, 101, -121, -69, -69, -86, -86, 68, 68, 85, 85, 120, 86, 52, 18}} address: 0x10030010 0x1234567887654321 > VR0 after lvx from MMIO (gdb) p $vr0 $3 = {uint128 = 0x1234567887654321, v4_float = { -1.72477726e-34, -3.03283305e-13, 1.46555735e+13, 5.69045661e-28}, v4_int32 = {-2023406815, -1431651397, 1431651396, 305419896}, v8_int16 = { 17185, -30875, -17477, -21846, 17476, 21845, 22136, 4660}, v16_int8 = {33, 67, 101, -121, -69, -69, -86, -86, 68, 68, 85, 85, 120, 86, 52, 18}} io_address: 0x3fffb7f7 0x1234567887654321 I only see it wrong when I mess with copy order: if (vcpu->arch.mmio_vmx_copy_nums == /*from 1 to */ 2) { VCPU_VSX_VR(vcpu, index).u[kvmppc_get_vsr_word_offset(2)] = lo; VCPU_VSX_VR(vcpu, index).u[kvmppc_get_vsr_word_offset(3)] = hi; } else if (vcpu->arch.mmio_vmx_copy_nums == /*from 2 to */ 1) { VCPU_VSX_VR(vcpu, index).u[kvmppc_get_vsr_word_offset(0)] = lo; VCPU_VSX_VR(vcpu, index).u[kvmppc_get_vsr_word_offset(1)] = hi; } then I get: address: 0x1003b530010 0x1234567887654321 io_address: 0x3fff811a 0x8765432112345678 Anyway, your suggestion works great and it's a way more elegant/easy to understand. I'll send the next version with it. Thank you very much for your patience and hints that helped me to understand KVM better. :-) > > Paul. >
Re: [PATCH RESEND 1/1] KVM: PPC: Book3S: Add MMIO emulation for VMX instructions
On Thu, Jan 11, 2018 at 09:36:58AM +1100, Paul Mackerras wrote: > On Mon, Jan 08, 2018 at 04:29:31PM -0200, Jose Ricardo Ziviani wrote: > > This patch provides the MMIO load/store vector indexed > > X-Form emulation. > > > > Instructions implemented: lvx, stvx > > > > Signed-off-by: Jose Ricardo Ziviani> > What testing has been done of this patch? In particular, what > combinations of endianness of host and guest have been tested? For a > cross-endian situation (endianness of guest different from host) you > would need to load the two halves of the VMX register image in > reversed order (note that lvx/stvx are different from lxvd2x/stxvd2x > and lxvw4x/stxvw4x in this respect), and I don't see that being done. Hello! Thank you for reviewing it. Yes, I tested it with both in little endian mode. You're absolutely right, I'll review the cross-endian scenario that's not covered indeed. > > Also, lvx/stvx mask off the bottom 4 bits of the EA, and I don't see > that being done. > > [snip] > > > diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c > > index 1915e86cef6f..7d0f60359ee0 100644 > > --- a/arch/powerpc/kvm/powerpc.c > > +++ b/arch/powerpc/kvm/powerpc.c > > @@ -832,23 +832,7 @@ void kvm_arch_irq_bypass_del_producer(struct > > irq_bypass_consumer *cons, > > kvm->arch.kvm_ops->irq_bypass_del_producer(cons, prod); > > } > > > > -#ifdef CONFIG_VSX > > -static inline int kvmppc_get_vsr_dword_offset(int index) > > -{ > > - int offset; > > - > > - if ((index != 0) && (index != 1)) > > - return -1; > > - > > -#ifdef __BIG_ENDIAN > > - offset = index; > > -#else > > - offset = 1 - index; > > -#endif > > - > > - return offset; > > -} > > - > > Why does this function need to be moved down in the file? The main reason is that I need it too but I need it available with ALTIVEC(1) support, so I changed the guard to ALTIVEC and moved it down closer to other related functions under the same guard. But I can move it back if you prefer. (1)AFAIK it's possible to have ALTIVEC enabled but VSX disabled, not the other way around. > > > +#ifdef CONFIG_ALTIVEC > > static inline int kvmppc_get_vsr_word_offset(int index) > > { > > int offset; > > @@ -864,6 +848,40 @@ static inline int kvmppc_get_vsr_word_offset(int index) > > return offset; > > } > > > > +static inline void kvmppc_set_vmx_dword(struct kvm_vcpu *vcpu, > > + u64 gpr) > > +{ > > + int index = vcpu->arch.io_gpr & KVM_MMIO_REG_MASK; > > + u64 hi = gpr >> 32; > > + u64 lo = gpr & 0x; > > + > > + if (vcpu->arch.mmio_vmx_copy_nums == 1) { > > + VCPU_VSX_VR(vcpu, index).u[kvmppc_get_vsr_word_offset(2)] = hi; > > + VCPU_VSX_VR(vcpu, index).u[kvmppc_get_vsr_word_offset(3)] = lo; > > + } else if (vcpu->arch.mmio_vmx_copy_nums == 2) { > > + VCPU_VSX_VR(vcpu, index).u[kvmppc_get_vsr_word_offset(0)] = hi; > > + VCPU_VSX_VR(vcpu, index).u[kvmppc_get_vsr_word_offset(1)] = lo; > > + } > > This splitting of the value into hi/lo words looks to me like you're > assuming a big-endian byte ordering. It looks like it will end up > swapping the words in each half of the register on a little-endian > host (regardless of the endianness of the guest). Yep! I focused on VCPU_VSX_VR code to fill the vector correctly but missed the 64-bit value splitting. I'll fix it and improve the testing, considering other endian scenario. > > > +} > > +#endif /* CONFIG_ALTIVEC */ > > + > > +#ifdef CONFIG_VSX > > +static inline int kvmppc_get_vsr_dword_offset(int index) > > +{ > > + int offset; > > + > > + if ((index != 0) && (index != 1)) > > + return -1; > > + > > +#ifdef __BIG_ENDIAN > > + offset = index; > > +#else > > + offset = 1 - index; > > +#endif > > + > > + return offset; > > +} > > + > > static inline void kvmppc_set_vsr_dword(struct kvm_vcpu *vcpu, > > u64 gpr) > > { > > @@ -1027,6 +1045,11 @@ static void kvmppc_complete_mmio_load(struct > > kvm_vcpu *vcpu, > > KVMPPC_VSX_COPY_DWORD_LOAD_DUMP) > > kvmppc_set_vsr_dword_dump(vcpu, gpr); > > break; > > +#endif > > +#ifdef CONFIG_ALTIVEC > > + case KVM_MMIO_REG_VMX: > > + kvmppc_set_vmx_dword(vcpu, gpr); > > + break; > > #endif > > default: > > BUG(); > > @@ -1307,6 +1330,99 @@ static int kvmppc_emulate_mmio_vsx_loadstore(struct > > kvm_vcpu *vcpu, > > } > > #endif /* CONFIG_VSX */ > > > > +#ifdef CONFIG_ALTIVEC > > +/* handle quadword load access in two halves */ > > +int kvmppc_handle_load128_by2x64(struct kvm_run *run, struct kvm_vcpu > > *vcpu, > > + unsigned int rt, int is_default_endian) > > +{ > > + enum emulation_result emulated = EMULATE_DONE; > > + > > + while (vcpu->arch.mmio_vmx_copy_nums) { > > + emulated = __kvmppc_handle_load(run, vcpu, rt, 8, > > + is_default_endian,
Re: [PATCH] KVM: PPC: Book3S: Add capabilities for Meltdown/Spectre workarounds
On Tue, Jan 09, 2018 at 12:57:26PM +0100, Michal Suchánek wrote: > On Tue, 09 Jan 2018 19:39:14 +1100 > Suraj Jitindar Singhwrote: > > > > > s/behavior/behaviour > > Nope. Either is valid and the shorter American spelling is actually > more common. > > If you must nickpick choose something actually broken :p Suraj is right: struct h_cpu_char_result { u64 character; u64 behaviour; }; > > Thanks > > Michal >
Re: [PATCH v2 1/1] powerpc/pseries: increase pseries_dlpar_init initcall priority
On Tue, Jan 02, 2018 at 10:46:07PM +1100, Michael Ellerman wrote: > Jose Ricardo Zivianiwrites: > > > The hotplug engine uses its own workqueue to handle IRQ requests, the > > problem is that such workqueue is initialized after init_ras_IRQ, which > > will cause a kernel panic if any hotplug interruption is issued in that > > period of time. > > > > This patch changes the dlpar initcall registration to make sure it will > > be initialized before init_ras_IRQ. > > Sorry I know this is already v2, but I don't think this is the best fix. > > There's a dependency between the registration of the IRQ in the RAS > code, and the creation of the work queue in the DLPAR code, but it's > currently not explicit. That's the bug. So it'd be better to just make > it explicit. > > As a bonus we can add actual error checking of the workqueue allocation. > > Something like below, can you test it please? sure, no problem. I'll do it. Thanks for reviewing it!!! > > cheers > > diff --git a/arch/powerpc/platforms/pseries/dlpar.c > b/arch/powerpc/platforms/pseries/dlpar.c > index 6e35780c5962..dd8b29e58a98 100644 > --- a/arch/powerpc/platforms/pseries/dlpar.c > +++ b/arch/powerpc/platforms/pseries/dlpar.c > @@ -574,11 +574,26 @@ static ssize_t dlpar_show(struct class *class, struct > class_attribute *attr, > > static CLASS_ATTR_RW(dlpar); > > -static int __init pseries_dlpar_init(void) > +int __init dlpar_workqueue_init(void) > { > + if (pseries_hp_wq) > + return 0; > + > pseries_hp_wq = alloc_workqueue("pseries hotplug workqueue", > WQ_UNBOUND, 1); > + > + return pseries_hp_wq ? 0 : -ENOMEM; > +} > + > +static int __init dlpar_sysfs_init(void) > +{ > + int rc; > + > + rc = dlpar_workqueue_init(); > + if (rc) > + return rc; > + > return sysfs_create_file(kernel_kobj, _attr_dlpar.attr); > } > -machine_device_initcall(pseries, pseries_dlpar_init); > +machine_device_initcall(pseries, dlpar_sysfs_init); > > diff --git a/arch/powerpc/platforms/pseries/pseries.h > b/arch/powerpc/platforms/pseries/pseries.h > index 4470a3194311..1ae1d9f4dbe9 100644 > --- a/arch/powerpc/platforms/pseries/pseries.h > +++ b/arch/powerpc/platforms/pseries/pseries.h > @@ -98,4 +98,6 @@ static inline unsigned long cmo_get_page_size(void) > return CMO_PageSize; > } > > +int dlpar_workqueue_init(void); > + > #endif /* _PSERIES_PSERIES_H */ > diff --git a/arch/powerpc/platforms/pseries/ras.c > b/arch/powerpc/platforms/pseries/ras.c > index 4923ffe230cf..879a92327010 100644 > --- a/arch/powerpc/platforms/pseries/ras.c > +++ b/arch/powerpc/platforms/pseries/ras.c > @@ -69,8 +69,9 @@ static int __init init_ras_IRQ(void) > /* Hotplug Events */ > np = of_find_node_by_path("/event-sources/hot-plug-events"); > if (np != NULL) { > - request_event_sources_irqs(np, ras_hotplug_interrupt, > -"RAS_HOTPLUG"); > + if (dlpar_workqueue_init() == 0) > + request_event_sources_irqs(np, ras_hotplug_interrupt, > + "RAS_HOTPLUG"); > of_node_put(np); > } > >
Re: [PATCH 1/1] powerpc/pseries: Use the system workqueue as fallback to hotplug workqueue
On Fri, Dec 22, 2017 at 11:54:10AM +1100, David Gibson wrote: > On Thu, Dec 21, 2017 at 01:44:48PM -0200, Jose Ricardo Ziviani wrote: > > The hotplug engine uses its own workqueue to handle IRQ requests, the > > problem is that such workqueue is initialized not so early in the boot > > process. > > > > Thus, when the kernel is ready to handle IRQ requests, after the system > > workqueue is initialized, we have a timeframe where any hotplug issued > > by the client will result in a kernel panic. That timeframe goes until > > the hotplug workqueue is initialized. > > > > It would be good to have the hotplug workqueue initialized as soon as > > the system workqueue but I don't think it is possible. So, this patch > > uses the system workqueue as a fallback the handle such IRQs. > > > > Signed-off-by: Jose Ricardo Ziviani> > I don't think this is the right approach. > > It seems to me the bug is that the hotplug interrupt is registered in > init_ras_IRQ(), before the work queue is initialized in > pseries_dlpar_init(). We need to correct that ordering. > Oh, this makes much more sense. I'm going to make some tests and send a v2 then. Thanks for reviewing it! > > --- > > arch/powerpc/platforms/pseries/dlpar.c | 10 +- > > 1 file changed, 9 insertions(+), 1 deletion(-) > > > > diff --git a/arch/powerpc/platforms/pseries/dlpar.c > > b/arch/powerpc/platforms/pseries/dlpar.c > > index 6e35780c5962..0474aa14b5f6 100644 > > --- a/arch/powerpc/platforms/pseries/dlpar.c > > +++ b/arch/powerpc/platforms/pseries/dlpar.c > > @@ -399,7 +399,15 @@ void queue_hotplug_event(struct pseries_hp_errorlog > > *hp_errlog, > > work->errlog = hp_errlog_copy; > > work->hp_completion = hotplug_done; > > work->rc = rc; > > - queue_work(pseries_hp_wq, (struct work_struct *)work); > > + > > + /* The hotplug workqueue may happen to be NULL at the moment > > +* this code is executed, during the boot phase. So, in this > > +* scenario, we can fallback to the system workqueue. > > +*/ > > + if (unlikely(pseries_hp_wq == NULL)) > > + schedule_work((struct work_struct *)work); > > + else > > + queue_work(pseries_hp_wq, (struct work_struct *)work); > > } else { > > *rc = -ENOMEM; > > kfree(hp_errlog_copy); > > -- > David Gibson | I'll have my music baroque, and my code > david AT gibson.dropbear.id.au| minimalist, thank you. NOT _the_ > _other_ > | _way_ _around_! > http://www.ozlabs.org/~dgibson
Re: [PATCH] powerpc/eeh: Disable EEH stack dump by default
On Wed, Sep 20, 2017 at 02:47:08PM +1000, Michael Ellerman wrote: > Jose Ricardo Zivianiwrites: > > > Today, each EEH causes a stack dump to be printed in the logs. In > > production environment it's not quite necessary. Thus, this patch > > I'm unconvinced. A production environment is exactly where you don't > want to be getting an EEH, and so if you *do* then every bit of > information is helpful. > > > For example, instead of the following: > > > > [ 131.778661] EEH: Frozen PHB#2-PE#fd detected > > [ 131.778672] EEH: PE location: N/A, PHB location: N/A > > [ 131.778677] CPU: 21 PID: 10098 Comm: lspci Not tainted ... > > [ 131.778680] Call Trace: > > [ 131.778686] [c003a140bab0] [c0beb58c] dump_stack+... > > > > [ 131.778770] EEH: Detected PCI bus error on PHB#2-PE#fd > > [ 131.778775] EEH: This PCI device has failed 1 times in the last hour > > ... > > > > we will have this by default: > > > > [12777.175880] EEH: Frozen PHB#2-PE#fd detected > > [12777.175893] EEH: PE location: N/A, PHB location: N/A > > [12777.175922] EEH: Detected PCI bus error on PHB#2-PE#fd > > [12777.175931] EEH: This PCI device has failed 2 times in the last hour > > *What* PCI device? > > How am I supposed to know what device/driver just failed? If I had the > stack trace I could probably at least work it out based on the driver > involved. > > cheers > Thank you guys! More people told me it's important to keep it as is. Please, disregard this patch.
Re: [PATCH RFC] powerpc: Implements MMIO emulation for lvx/stvx instructions
On Wed, Aug 30, 2017 at 07:45:17PM +1000, Paul Mackerras wrote: > On Tue, Aug 29, 2017 at 07:18:01PM -0300, Jose Ricardo Ziviani wrote: > > Hello! > > > > This patch implements MMIO emulation for two instructions: lvx and stvx. I > > started to implement other instructions but I'd like to have this reviewed > > beforehand because this is my first patch here and I'll certainly have some > > rework/fixes :-). > > > > Note: stvx is only storing 8 bytes, for some reason the code > > "vcpu->arch.paddr_accessed += run->mmio.len;", which adds the 8-byte offset > > after the first write is not making any difference (interesting that it > > works for load operations). I'm still investigating it but any idea about > > it will be appreciated. > > The run structure is mmapped by userspace (i.e. QEMU) and can be > written by userspace between the first and the second exits to > userspace (you have to do two exits to userspace because you can only > transfer 8 bytes on each exit). It's possible that userspace might be > clearing run->mmio.len. In general it's better not to rely on > anything in *run (except of course the mmio_data for a MMIO read) when > we come in from userspace to the kernel. > > Paul. > Hello Paul, My bad, actually it works. I was mmap'ping an address that doesn't allow 16-byte writing access. After mmap'ping a higher address (of the same device) I was able to perform 16-byte read/write. == before stvx == (gdb) info registers vr0 vr0 {uint128 = 0x12345678abcdef09, ...} (gdb) info registers r9 r9 0x3fffb7c90010 (gdb) x /4wx 0x3fffb7c90010 0x3fffb7c90010: 0x 0x 0x 0x (gdb) info registers r28 r28 0x0 stvxv0,r28,r9 == after stvx == (gdb) x /4wx 0x3fffb7c90010 0x3fffb7c90010: 0x12345678 0x 0xabcdef09 0x == before lvx == (gdb) info registers vr10 vr10 {uint128 = 0x,...} lvx v10,r28,r9 == after lvx == (gdb) info registers vr10 vr10 {uint128 = 0x12345678abcdef09,...} If you think it's ok I'll submit this patch without the RFC. Thank you very much! Ziviani
Re: [PATCH v2] powerpc/mm: Check for _PAGE_PTE in *_devmap()
On Fri, Jul 28, 2017 at 01:35:53AM +1000, Oliver O'Halloran wrote: > The ISA radix translation tree contains two different types of entry, > directories and leaves. The formats of the two entries are different > with the directory entries containing no spare bits for use by software. > As a result we need to ensure that the *_devmap() family of functions > check fail for everything except leaf (PTE) entries. > > Signed-off-by: Oliver O'Halloran> --- > "i'll just tweak the mbox before i sent it, what's the worst that can happen" > *completely breaks KVM* > "..." > --- > arch/powerpc/include/asm/book3s/64/pgtable.h | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h > b/arch/powerpc/include/asm/book3s/64/pgtable.h > index d1da415..6bc6248 100644 > --- a/arch/powerpc/include/asm/book3s/64/pgtable.h > +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h > @@ -610,7 +610,9 @@ static inline pte_t pte_mkdevmap(pte_t pte) > > static inline int pte_devmap(pte_t pte) > { > - return !!(pte_raw(pte) & cpu_to_be64(_PAGE_DEVMAP)); > + uint64_t mask = cpu_to_be64(_PAGE_DEVMAP | _PAGE_PTE); > + > + return (pte_raw(pte) & mask) == mask; > } > > static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) > -- > 2.7.4 > Michael should thank you because I was about to send another insane suggestion and ask him if that makes sense. :-D Tested-by: Jose Ricardo Ziviani
Re: KVM guests freeze under upstream kernel
On Thu, Jul 20, 2017 at 10:18:18PM -0300, jos...@linux.vnet.ibm.com wrote: > On Thu, Jul 20, 2017 at 03:21:59PM +1000, Paul Mackerras wrote: > > On Thu, Jul 20, 2017 at 12:02:23AM -0300, jos...@linux.vnet.ibm.com wrote: > > > On Thu, Jul 20, 2017 at 09:42:50AM +1000, Benjamin Herrenschmidt wrote: > > > > On Wed, 2017-07-19 at 16:46 -0300, jos...@linux.vnet.ibm.com wrote: > > > > > Hello! > > > > > > > > > > We're not able to boot any KVM guest using upstream kernel > > > > > (cb8c65ccff7f77d0285f1b126c72d37b2572c865 - 4.13.0-rc1+). > > > > > After reaching the SLOF initial counting, the guest simply freezes: > > > > > > > > Can you send our .config ? > > > > > > Sure, > > > > > > Answering Michael as well: > > > > > > It's a P9 with RHEL kernel 4.11.0-10.el7a.ppc64le installed. The problem > > > was noticed with kernel > 4.13 (I'm currently running 4.13.0-rc1+). > > > > > > QEMU is https://github.com/dgibson/qemu (ppc-for-2.10) but I gave the > > > default packaged Qemu a try. > > > > > > For the guest, I tried both a vanilla Ubuntu 17.04 and the host kernel. > > > But they had never a chance to run since the freezing happened in SLOF. > > > > > > Note that using the 4.11.0-10.el7a.ppc64le kernel it works fine > > > (for any of these Qemu/Guest setup). With 4.13.0-rc1 I have it run after > > > reverting that referred commit. > > > > Is the host kernel running in radix mode? > > yes > > > > > Did you check the host kernel logs for any oops messages? > > dmesg was clean but after sometime waiting (I forgot QEMU running in > another terminal) I got the oops below (after rebooting the host I > couldn't reproduce it again). > > Another test that I did was: > Compile with transparent huge pages disabled: KVM works fine > Compile with transparent huge pages enabled: doesn't work > + disabling it in /sys/kernel/mm/transparent_hugepage: doesn't work > > Just out of my own curiosity I made this small change: > > diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h > b/arch/powerpc/include > index c0737c8..f94a3b6 100644 > --- a/arch/powerpc/include/asm/book3s/64/pgtable.h > +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h > @@ -80,7 +80,7 @@ > > #define _PAGE_SOFT_DIRTY _RPAGE_SW3 /* software: software dirty > tracking >#define _PAGE_SPECIAL _RPAGE_SW2 /* software: special page */ >-#define _PAGE_DEVMAP _RPAGE_SW1 /* software: ZONE_DEVICE page */ >+#define _PAGE_DEVMAP _RPAGE_RSV3 > #define __HAVE_ARCH_PTE_DEVMAP > > and it works. I chose _RPAGE_RSV3 because it uses the same value that > x86 uses (0x0400UL) but I don't if it could have any side > effect > Does this change make any sense to you people? I didn't see any side effect expect that devices backed memory will have a bigger address space in transparent huge pages IF I understand that correctly. If so I can send a patch with this change. Thank you!!
Re: KVM guests freeze under upstream kernel
On Thu, Jul 20, 2017 at 03:21:59PM +1000, Paul Mackerras wrote: > On Thu, Jul 20, 2017 at 12:02:23AM -0300, jos...@linux.vnet.ibm.com wrote: > > On Thu, Jul 20, 2017 at 09:42:50AM +1000, Benjamin Herrenschmidt wrote: > > > On Wed, 2017-07-19 at 16:46 -0300, jos...@linux.vnet.ibm.com wrote: > > > > Hello! > > > > > > > > We're not able to boot any KVM guest using upstream kernel > > > > (cb8c65ccff7f77d0285f1b126c72d37b2572c865 - 4.13.0-rc1+). > > > > After reaching the SLOF initial counting, the guest simply freezes: > > > > > > Can you send our .config ? > > > > Sure, > > > > Answering Michael as well: > > > > It's a P9 with RHEL kernel 4.11.0-10.el7a.ppc64le installed. The problem > > was noticed with kernel > 4.13 (I'm currently running 4.13.0-rc1+). > > > > QEMU is https://github.com/dgibson/qemu (ppc-for-2.10) but I gave the > > default packaged Qemu a try. > > > > For the guest, I tried both a vanilla Ubuntu 17.04 and the host kernel. > > But they had never a chance to run since the freezing happened in SLOF. > > > > Note that using the 4.11.0-10.el7a.ppc64le kernel it works fine > > (for any of these Qemu/Guest setup). With 4.13.0-rc1 I have it run after > > reverting that referred commit. > > Is the host kernel running in radix mode? yes > > Did you check the host kernel logs for any oops messages? dmesg was clean but after sometime waiting (I forgot QEMU running in another terminal) I got the oops below (after rebooting the host I couldn't reproduce it again). Another test that I did was: Compile with transparent huge pages disabled: KVM works fine Compile with transparent huge pages enabled: doesn't work + disabling it in /sys/kernel/mm/transparent_hugepage: doesn't work Just out of my own curiosity I made this small change: diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include index c0737c8..f94a3b6 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -80,7 +80,7 @@ #define _PAGE_SOFT_DIRTY _RPAGE_SW3 /* software: software dirty tracking #define _PAGE_SPECIAL _RPAGE_SW2 /* software: special page */ -#define _PAGE_DEVMAP _RPAGE_SW1 /* software: ZONE_DEVICE page */ +#define _PAGE_DEVMAP _RPAGE_RSV3 #define __HAVE_ARCH_PTE_DEVMAP and it works. I chose _RPAGE_RSV3 because it uses the same value that x86 uses (0x0400UL) but I don't if it could have any side effect SLOF ** QEMU Starting Build Date = Mar 3 2017 13:29:19 FW Version = git-66d250ef0fd06bb8 Press "s" to enter Open Firmware. [ 105.604333] Unable to handle kernel paging request for data at address 0x [ 105.604448] Faulting instruction address: 0xc0910b28 [ 105.604526] Oops: Kernel access of bad area, sig: 11 [#1] [ 105.604585] SMP NR_CPUS=2048 [ 105.604588] NUMA [ 105.604633] PowerNV [ 105.604697] Modules linked in: xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter kvm_hv kvm i2c_dev at24 ghash_generic ses enclosure gf128mul scsi_transport_sas xts sg ctr ipmi_powernv ipmi_devintf shpchp opal_prd vmx_crypto ipmi_msghandler uio_pdrv_genirq uio ofpart powernv_flash i2c_opal ibmpowernv mtd nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c [ 105.605561] sd_mod ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm i40e i2c_core aacraid ptp pps_core dm_mirror dm_region_hash dm_log dm_mod [ 105.605759] CPU: 0 PID: 6 Comm: kworker/u32:0 Not tainted 4.13.0-rc1+ #57 [ 105.605836] Workqueue: netns cleanup_net [ 105.605880] task: c00ff6404200 task.stack: c00ff648c000 [ 105.605947] NIP: c0910b28 LR: c07cd6ec CTR: c07cd5d0 [ 105.606026] REGS: c00ff648f7d0 TRAP: 0300 Not tainted (4.13.0-rc1+) [ 105.606090] MSR: 90009033[ 105.606111] CR: 88002048 XER: 2000 [ 105.606203] CFAR: c07cd6e8 DAR: DSISR: 4000 SOFTE: 1 [ 105.606203] GPR00: c07cd6ec c00ff648fa50 c0f5c600 [ 105.606203] GPR04: c00ff6404cc0 c00ff6404280 782ccd5c cc908fe7 [ 105.606203] GPR08: c00ff648c000 8000 [ 105.606203] GPR12: c07cd5d0 cfb0 c01050f8 c00ffa150ec0
Re: KVM guests freeze under upstream kernel
On Thu, Jul 20, 2017 at 09:42:50AM +1000, Benjamin Herrenschmidt wrote: > On Wed, 2017-07-19 at 16:46 -0300, jos...@linux.vnet.ibm.com wrote: > > Hello! > > > > We're not able to boot any KVM guest using upstream kernel > > (cb8c65ccff7f77d0285f1b126c72d37b2572c865 - 4.13.0-rc1+). > > After reaching the SLOF initial counting, the guest simply freezes: > > Can you send our .config ? Sure, Answering Michael as well: It's a P9 with RHEL kernel 4.11.0-10.el7a.ppc64le installed. The problem was noticed with kernel > 4.13 (I'm currently running 4.13.0-rc1+). QEMU is https://github.com/dgibson/qemu (ppc-for-2.10) but I gave the default packaged Qemu a try. For the guest, I tried both a vanilla Ubuntu 17.04 and the host kernel. But they had never a chance to run since the freezing happened in SLOF. Note that using the 4.11.0-10.el7a.ppc64le kernel it works fine (for any of these Qemu/Guest setup). With 4.13.0-rc1 I have it run after reverting that referred commit. Thanks! (config attached) -- # # Automatically generated file; DO NOT EDIT. # Linux/powerpc 4.13.0-rc1 Kernel Configuration # CONFIG_PPC64=y # # Processor support # CONFIG_PPC_BOOK3S_64=y # CONFIG_PPC_BOOK3E_64 is not set # CONFIG_POWER7_CPU is not set CONFIG_POWER8_CPU=y CONFIG_PPC_BOOK3S=y CONFIG_PPC_FPU=y CONFIG_ALTIVEC=y CONFIG_VSX=y CONFIG_PPC_ICSWX=y # CONFIG_PPC_ICSWX_PID is not set # CONFIG_PPC_ICSWX_USE_SIGILL is not set CONFIG_PPC_STD_MMU=y CONFIG_PPC_STD_MMU_64=y CONFIG_PPC_RADIX_MMU=y CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION=y CONFIG_PPC_MM_SLICES=y CONFIG_PPC_HAVE_PMU_SUPPORT=y CONFIG_PPC_PERF_CTRS=y CONFIG_FORCE_SMP=y CONFIG_SMP=y CONFIG_NR_CPUS=2048 CONFIG_PPC_DOORBELL=y # CONFIG_CPU_BIG_ENDIAN is not set CONFIG_CPU_LITTLE_ENDIAN=y CONFIG_PPC64_BOOT_WRAPPER=y CONFIG_64BIT=y CONFIG_ARCH_PHYS_ADDR_T_64BIT=y CONFIG_ARCH_DMA_ADDR_T_64BIT=y CONFIG_MMU=y CONFIG_ARCH_MMAP_RND_BITS_MAX=29 CONFIG_ARCH_MMAP_RND_BITS_MIN=14 CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=13 CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=7 CONFIG_HAVE_SETUP_PER_CPU_AREA=y CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y CONFIG_NR_IRQS=512 CONFIG_NMI_IPI=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_TRACE_IRQFLAGS_SUPPORT=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_HAS_DMA_SET_COHERENT_MASK=y CONFIG_PPC=y # CONFIG_GENERIC_CSUM is not set CONFIG_EARLY_PRINTK=y CONFIG_PANIC_TIMEOUT=180 CONFIG_COMPAT=y CONFIG_SYSVIPC_COMPAT=y CONFIG_SCHED_OMIT_FRAME_POINTER=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_PPC_UDBG_16550=y # CONFIG_GENERIC_TBSYNC is not set CONFIG_AUDIT_ARCH=y CONFIG_GENERIC_BUG=y CONFIG_EPAPR_BOOT=y # CONFIG_DEFAULT_UIMAGE is not set CONFIG_ARCH_HIBERNATION_POSSIBLE=y CONFIG_ARCH_SUSPEND_POSSIBLE=y # CONFIG_PPC_DCR_NATIVE is not set # CONFIG_PPC_DCR_MMIO is not set # CONFIG_PPC_OF_PLATFORM_PCI is not set CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y CONFIG_ARCH_SUPPORTS_UPROBES=y CONFIG_PPC_EMULATE_SSTEP=y CONFIG_ZONE_DMA32=y CONFIG_PGTABLE_LEVELS=4 CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" CONFIG_IRQ_WORK=y CONFIG_BUILDTIME_EXTABLE_SORT=y # # General setup # CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_CROSS_COMPILE="" # CONFIG_COMPILE_TEST is not set CONFIG_LOCALVERSION="" # CONFIG_LOCALVERSION_AUTO is not set CONFIG_HAVE_KERNEL_GZIP=y CONFIG_HAVE_KERNEL_XZ=y CONFIG_KERNEL_GZIP=y # CONFIG_KERNEL_XZ is not set CONFIG_DEFAULT_HOSTNAME="(none)" CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_POSIX_MQUEUE_SYSCTL=y CONFIG_CROSS_MEMORY_ATTACH=y CONFIG_FHANDLE=y # CONFIG_USELIB is not set CONFIG_AUDIT=y CONFIG_HAVE_ARCH_AUDITSYSCALL=y CONFIG_AUDITSYSCALL=y CONFIG_AUDIT_WATCH=y CONFIG_AUDIT_TREE=y # # IRQ subsystem # CONFIG_GENERIC_IRQ_SHOW=y CONFIG_GENERIC_IRQ_SHOW_LEVEL=y CONFIG_GENERIC_IRQ_MIGRATION=y CONFIG_HARDIRQS_SW_RESEND=y CONFIG_IRQ_DOMAIN=y CONFIG_GENERIC_MSI_IRQ=y # CONFIG_IRQ_DOMAIN_DEBUG is not set CONFIG_IRQ_FORCED_THREADING=y CONFIG_SPARSE_IRQ=y # CONFIG_GENERIC_IRQ_DEBUGFS is not set CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_ARCH_HAS_TICK_BROADCAST=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_GENERIC_CMOS_UPDATE=y # # Timers subsystem # CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ_COMMON=y # CONFIG_HZ_PERIODIC is not set # CONFIG_NO_HZ_IDLE is not set CONFIG_NO_HZ_FULL=y # CONFIG_NO_HZ_FULL_ALL is not set CONFIG_NO_HZ=y CONFIG_HIGH_RES_TIMERS=y # # CPU/Task time and stats accounting # CONFIG_VIRT_CPU_ACCOUNTING=y CONFIG_VIRT_CPU_ACCOUNTING_GEN=y # CONFIG_IRQ_TIME_ACCOUNTING is not set CONFIG_BSD_PROCESS_ACCT=y CONFIG_BSD_PROCESS_ACCT_V3=y CONFIG_TASKSTATS=y CONFIG_TASK_DELAY_ACCT=y CONFIG_TASK_XACCT=y CONFIG_TASK_IO_ACCOUNTING=y # # RCU Subsystem # CONFIG_TREE_RCU=y # CONFIG_RCU_EXPERT is not set CONFIG_SRCU=y CONFIG_TREE_SRCU=y # CONFIG_TASKS_RCU is not set CONFIG_RCU_STALL_COMMON=y CONFIG_RCU_NEED_SEGCBLIST=y CONFIG_CONTEXT_TRACKING=y # CONFIG_CONTEXT_TRACKING_FORCE is not set CONFIG_RCU_NOCB_CPU=y # CONFIG_BUILD_BIN2C is not set #
KVM guests freeze under upstream kernel
Hello! We're not able to boot any KVM guest using upstream kernel (cb8c65ccff7f77d0285f1b126c72d37b2572c865 - 4.13.0-rc1+). After reaching the SLOF initial counting, the guest simply freezes: SLOF ** QEMU Starting Build Date = Mar 3 2017 13:29:19 FW Version = git-66d250ef0fd06bb8 Press "s" to enter Open Firmware. C0360 After bisecting I found the commit: https://github.com/torvalds/linux/commit/ebd3119 powerpc/mm: Add devmap support for ppc64 Add support for the devmap bit on PTEs and PMDs for PPC64 Book3S. This is used to differentiate device backed memory from transparent huge pages since they are handled in more or less the same manner by the core mm code. Reverting the commit and rebuilding 4.13.0-rc1+ was enough to make a workaround. But I'll need some help from you guys in order to solve it. Thanks! Jose Ziviani