Re: [PATCH v5 0/1] Implements MMIO emulation for lvx/stvx instructions

2018-02-02 Thread joserz
On Fri, Feb 02, 2018 at 11:30:18AM +1100, Paul Mackerras wrote:
> On Thu, Feb 01, 2018 at 04:15:38PM -0200, Jose Ricardo Ziviani wrote:
> > v5:
> >  - Fixed the mask off of the effective address
> > 
> > v4:
> >   - Changed KVM_MMIO_REG_VMX to 0xc0 because there are 64 VSX registers
> > 
> > v3:
> >   - Added Reported-by in the commit message
> > 
> > v2:
> >   - kvmppc_get_vsr_word_offset() moved back to its original place
> >   - EA AND ~0xF, following ISA.
> >   - fixed BE/LE cases
> > 
> > TESTS:
> > 
> > For testing purposes I wrote a small program that performs stvx/lvx using 
> > the
> > program's virtual memory and using MMIO. Load/Store into virtual memory is 
> > the
> > model I use to check if MMIO results are correct (because only MMIO is 
> > emulated
> > by KVM).
> 
> I'd be interested to see your test program because in my testing it's
> still not right, unfortunately.  Interestingly, it is right for the BE
> guest on LE host case.  However, with a LE guest on a LE host the two
> halves are swapped, both for lvx and stvx:

Absolutely, here it's: 
https://gist.github.com/jrziviani/a65e71c5d661bffa8afcd6710fedd520
It basically maps an IO region and also allocates some memory from the
program's address space. Then I store to/load from both addresses and
compare the results. Because only the mmio load/store are emulated, I
use the regular load/store as a model.

> 
> error in lvx at byte 0
> was: -> 62 69 70 77 7e 85 8c 93 2a 31 38 3f 46 4d 54 5b
> ref: -> 2a 31 38 3f 46 4d 54 5b 62 69 70 77 7e 85 8c 93
> error in stvx at byte 0
> was: -> 49 50 57 5e 65 6c 73 7a 11 18 1f 26 2d 34 3b 42
> ref: -> 11 18 1f 26 2d 34 3b 42 49 50 57 5e 65 6c 73 7a
> 
> The byte order within each 8-byte half is correct but the two halves
> are swapped.  ("was" is what was in memory and "ref" is the correct
> value.  For lvx it does lvx from emulated MMIO and stvx to ordinary
> memory, and for stvx it does lvx from ordinary memory and stvx to
> emulated MMIO.  In both cases the checking is done with a byte by byte
> comparison.)

The funny thing is that I still see it right in both cases, so I believe
that my test case is incorrect. Example (host LE, guest LE):

> VR0 after lvx
(gdb) p $vr0
{uint128 = 0x1234567887654321, v4_float = {
-1.72477726e-34, -3.03283305e-13, 1.46555735e+13, 5.69045661e-28},
  v4_int32 = {-2023406815, -1431651397, 1431651396, 305419896}, v8_int16 = {
17185, -30875, -17477, -21846, 17476, 21845, 22136, 4660}, v16_int8 = {33,
67, 101, -121, -69, -69, -86, -86, 68, 68, 85, 85, 120, 86, 52, 18}}


address: 0x10030010
0x1234567887654321

> VR0 after lvx from MMIO
(gdb) p $vr0
$3 = {uint128 = 0x1234567887654321, v4_float = {
-1.72477726e-34, -3.03283305e-13, 1.46555735e+13, 5.69045661e-28},
  v4_int32 = {-2023406815, -1431651397, 1431651396, 305419896}, v8_int16 = {
17185, -30875, -17477, -21846, 17476, 21845, 22136, 4660}, v16_int8 = {33,
67, 101, -121, -69, -69, -86, -86, 68, 68, 85, 85, 120, 86, 52, 18}}

io_address: 0x3fffb7f7
0x1234567887654321

I only see it wrong when I mess with copy order:

if (vcpu->arch.mmio_vmx_copy_nums == /*from 1 to */ 2) {
VCPU_VSX_VR(vcpu, index).u[kvmppc_get_vsr_word_offset(2)] = lo;
VCPU_VSX_VR(vcpu, index).u[kvmppc_get_vsr_word_offset(3)] = hi;
} else if (vcpu->arch.mmio_vmx_copy_nums == /*from 2 to */ 1) {
VCPU_VSX_VR(vcpu, index).u[kvmppc_get_vsr_word_offset(0)] = lo;
VCPU_VSX_VR(vcpu, index).u[kvmppc_get_vsr_word_offset(1)] = hi;
}

then I get:

address: 0x1003b530010
0x1234567887654321
io_address: 0x3fff811a
0x8765432112345678

Anyway, your suggestion works great and it's a way more elegant/easy to
understand. I'll send the next version with it.

Thank you very much for your patience and hints that helped me to
understand KVM better. :-)

> 
> Paul.
> 



Re: [PATCH RESEND 1/1] KVM: PPC: Book3S: Add MMIO emulation for VMX instructions

2018-01-10 Thread joserz
On Thu, Jan 11, 2018 at 09:36:58AM +1100, Paul Mackerras wrote:
> On Mon, Jan 08, 2018 at 04:29:31PM -0200, Jose Ricardo Ziviani wrote:
> > This patch provides the MMIO load/store vector indexed
> > X-Form emulation.
> > 
> > Instructions implemented: lvx, stvx
> > 
> > Signed-off-by: Jose Ricardo Ziviani 
> 
> What testing has been done of this patch?  In particular, what
> combinations of endianness of host and guest have been tested?  For a
> cross-endian situation (endianness of guest different from host) you
> would need to load the two halves of the VMX register image in
> reversed order (note that lvx/stvx are different from lxvd2x/stxvd2x
> and lxvw4x/stxvw4x in this respect), and I don't see that being done.

Hello! Thank you for reviewing it.

Yes, I tested it with both in little endian mode. You're absolutely
right, I'll review the cross-endian scenario that's not covered indeed.

> 
> Also, lvx/stvx mask off the bottom 4 bits of the EA, and I don't see
> that being done.
> 
> [snip]
> 
> > diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> > index 1915e86cef6f..7d0f60359ee0 100644
> > --- a/arch/powerpc/kvm/powerpc.c
> > +++ b/arch/powerpc/kvm/powerpc.c
> > @@ -832,23 +832,7 @@ void kvm_arch_irq_bypass_del_producer(struct 
> > irq_bypass_consumer *cons,
> > kvm->arch.kvm_ops->irq_bypass_del_producer(cons, prod);
> >  }
> >  
> > -#ifdef CONFIG_VSX
> > -static inline int kvmppc_get_vsr_dword_offset(int index)
> > -{
> > -   int offset;
> > -
> > -   if ((index != 0) && (index != 1))
> > -   return -1;
> > -
> > -#ifdef __BIG_ENDIAN
> > -   offset =  index;
> > -#else
> > -   offset = 1 - index;
> > -#endif
> > -
> > -   return offset;
> > -}
> > -
> 
> Why does this function need to be moved down in the file?

The main reason is that I need it too but I need it available with
ALTIVEC(1) support, so I changed the guard to ALTIVEC and moved it down
closer to other related functions under the same guard. But I can move
it back if you prefer.

(1)AFAIK it's possible to have ALTIVEC enabled but VSX disabled, not the
other way around.

> 
> > +#ifdef CONFIG_ALTIVEC
> >  static inline int kvmppc_get_vsr_word_offset(int index)
> >  {
> > int offset;
> > @@ -864,6 +848,40 @@ static inline int kvmppc_get_vsr_word_offset(int index)
> > return offset;
> >  }
> >  
> > +static inline void kvmppc_set_vmx_dword(struct kvm_vcpu *vcpu,
> > +   u64 gpr)
> > +{
> > +   int index = vcpu->arch.io_gpr & KVM_MMIO_REG_MASK;
> > +   u64 hi = gpr >> 32;
> > +   u64 lo = gpr & 0x;
> > +
> > +   if (vcpu->arch.mmio_vmx_copy_nums == 1) {
> > +   VCPU_VSX_VR(vcpu, index).u[kvmppc_get_vsr_word_offset(2)] = hi;
> > +   VCPU_VSX_VR(vcpu, index).u[kvmppc_get_vsr_word_offset(3)] = lo;
> > +   } else if (vcpu->arch.mmio_vmx_copy_nums == 2) {
> > +   VCPU_VSX_VR(vcpu, index).u[kvmppc_get_vsr_word_offset(0)] = hi;
> > +   VCPU_VSX_VR(vcpu, index).u[kvmppc_get_vsr_word_offset(1)] = lo;
> > +   }
> 
> This splitting of the value into hi/lo words looks to me like you're
> assuming a big-endian byte ordering.  It looks like it will end up
> swapping the words in each half of the register on a little-endian
> host (regardless of the endianness of the guest).

Yep! I focused on VCPU_VSX_VR code to fill the vector correctly but
missed the 64-bit value splitting. I'll fix it and improve the testing,
considering other endian scenario.

> 
> > +}
> > +#endif /* CONFIG_ALTIVEC */
> > +
> > +#ifdef CONFIG_VSX
> > +static inline int kvmppc_get_vsr_dword_offset(int index)
> > +{
> > +   int offset;
> > +
> > +   if ((index != 0) && (index != 1))
> > +   return -1;
> > +
> > +#ifdef __BIG_ENDIAN
> > +   offset =  index;
> > +#else
> > +   offset = 1 - index;
> > +#endif
> > +
> > +   return offset;
> > +}
> > +
> >  static inline void kvmppc_set_vsr_dword(struct kvm_vcpu *vcpu,
> > u64 gpr)
> >  {
> > @@ -1027,6 +1045,11 @@ static void kvmppc_complete_mmio_load(struct 
> > kvm_vcpu *vcpu,
> > KVMPPC_VSX_COPY_DWORD_LOAD_DUMP)
> > kvmppc_set_vsr_dword_dump(vcpu, gpr);
> > break;
> > +#endif
> > +#ifdef CONFIG_ALTIVEC
> > +   case KVM_MMIO_REG_VMX:
> > +   kvmppc_set_vmx_dword(vcpu, gpr);
> > +   break;
> >  #endif
> > default:
> > BUG();
> > @@ -1307,6 +1330,99 @@ static int kvmppc_emulate_mmio_vsx_loadstore(struct 
> > kvm_vcpu *vcpu,
> >  }
> >  #endif /* CONFIG_VSX */
> >  
> > +#ifdef CONFIG_ALTIVEC
> > +/* handle quadword load access in two halves */
> > +int kvmppc_handle_load128_by2x64(struct kvm_run *run, struct kvm_vcpu 
> > *vcpu,
> > +   unsigned int rt, int is_default_endian)
> > +{
> > +   enum emulation_result emulated = EMULATE_DONE;
> > +
> > +   while (vcpu->arch.mmio_vmx_copy_nums) {
> > +   emulated = __kvmppc_handle_load(run, vcpu, rt, 8,
> > +   is_default_endian, 

Re: [PATCH] KVM: PPC: Book3S: Add capabilities for Meltdown/Spectre workarounds

2018-01-09 Thread joserz
On Tue, Jan 09, 2018 at 12:57:26PM +0100, Michal Suchánek wrote:
> On Tue, 09 Jan 2018 19:39:14 +1100
> Suraj Jitindar Singh  wrote:
> 
> > 
> > s/behavior/behaviour
> 
> Nope. Either is valid and the shorter American spelling is actually
> more common.
> 
> If you must nickpick choose something actually broken :p

Suraj is right:

struct h_cpu_char_result {
   u64 character;
   u64 behaviour;
};

> 
> Thanks
> 
> Michal
> 



Re: [PATCH v2 1/1] powerpc/pseries: increase pseries_dlpar_init initcall priority

2018-01-02 Thread joserz
On Tue, Jan 02, 2018 at 10:46:07PM +1100, Michael Ellerman wrote:
> Jose Ricardo Ziviani  writes:
> 
> > The hotplug engine uses its own workqueue to handle IRQ requests, the
> > problem is that such workqueue is initialized after init_ras_IRQ, which
> > will cause a kernel panic if any hotplug interruption is issued in that
> > period of time.
> >
> > This patch changes the dlpar initcall registration to make sure it will
> > be initialized before init_ras_IRQ.
> 
> Sorry I know this is already v2, but I don't think this is the best fix.
> 
> There's a dependency between the registration of the IRQ in the RAS
> code, and the creation of the work queue in the DLPAR code, but it's
> currently not explicit. That's the bug. So it'd be better to just make
> it explicit.
> 
> As a bonus we can add actual error checking of the workqueue allocation.
> 
> Something like below, can you test it please?

sure, no problem. I'll do it.

Thanks for reviewing it!!!

> 
> cheers
> 
> diff --git a/arch/powerpc/platforms/pseries/dlpar.c 
> b/arch/powerpc/platforms/pseries/dlpar.c
> index 6e35780c5962..dd8b29e58a98 100644
> --- a/arch/powerpc/platforms/pseries/dlpar.c
> +++ b/arch/powerpc/platforms/pseries/dlpar.c
> @@ -574,11 +574,26 @@ static ssize_t dlpar_show(struct class *class, struct 
> class_attribute *attr,
> 
>  static CLASS_ATTR_RW(dlpar);
> 
> -static int __init pseries_dlpar_init(void)
> +int __init dlpar_workqueue_init(void)
>  {
> + if (pseries_hp_wq)
> + return 0;
> +
>   pseries_hp_wq = alloc_workqueue("pseries hotplug workqueue",
>   WQ_UNBOUND, 1);
> +
> + return pseries_hp_wq ? 0 : -ENOMEM;
> +}
> +
> +static int __init dlpar_sysfs_init(void)
> +{
> + int rc;
> +
> + rc = dlpar_workqueue_init();
> + if (rc)
> + return rc;
> +
>   return sysfs_create_file(kernel_kobj, _attr_dlpar.attr);
>  }
> -machine_device_initcall(pseries, pseries_dlpar_init);
> +machine_device_initcall(pseries, dlpar_sysfs_init);
> 
> diff --git a/arch/powerpc/platforms/pseries/pseries.h 
> b/arch/powerpc/platforms/pseries/pseries.h
> index 4470a3194311..1ae1d9f4dbe9 100644
> --- a/arch/powerpc/platforms/pseries/pseries.h
> +++ b/arch/powerpc/platforms/pseries/pseries.h
> @@ -98,4 +98,6 @@ static inline unsigned long cmo_get_page_size(void)
>   return CMO_PageSize;
>  }
> 
> +int dlpar_workqueue_init(void);
> +
>  #endif /* _PSERIES_PSERIES_H */
> diff --git a/arch/powerpc/platforms/pseries/ras.c 
> b/arch/powerpc/platforms/pseries/ras.c
> index 4923ffe230cf..879a92327010 100644
> --- a/arch/powerpc/platforms/pseries/ras.c
> +++ b/arch/powerpc/platforms/pseries/ras.c
> @@ -69,8 +69,9 @@ static int __init init_ras_IRQ(void)
>   /* Hotplug Events */
>   np = of_find_node_by_path("/event-sources/hot-plug-events");
>   if (np != NULL) {
> - request_event_sources_irqs(np, ras_hotplug_interrupt,
> -"RAS_HOTPLUG");
> + if (dlpar_workqueue_init() == 0)
> + request_event_sources_irqs(np, ras_hotplug_interrupt,
> + "RAS_HOTPLUG");
>   of_node_put(np);
>   }
> 
> 



Re: [PATCH 1/1] powerpc/pseries: Use the system workqueue as fallback to hotplug workqueue

2017-12-22 Thread joserz
On Fri, Dec 22, 2017 at 11:54:10AM +1100, David Gibson wrote:
> On Thu, Dec 21, 2017 at 01:44:48PM -0200, Jose Ricardo Ziviani wrote:
> > The hotplug engine uses its own workqueue to handle IRQ requests, the
> > problem is that such workqueue is initialized not so early in the boot
> > process.
> > 
> > Thus, when the kernel is ready to handle IRQ requests, after the system
> > workqueue is initialized, we have a timeframe where any hotplug issued
> > by the client will result in a kernel panic. That timeframe goes until
> > the hotplug workqueue is initialized.
> > 
> > It would be good to have the hotplug workqueue initialized as soon as
> > the system workqueue but I don't think it is possible. So, this patch
> > uses the system workqueue as a fallback the handle such IRQs.
> > 
> > Signed-off-by: Jose Ricardo Ziviani 
> 
> I don't think this is the right approach.
> 
> It seems to me the bug is that the hotplug interrupt is registered in
> init_ras_IRQ(), before the work queue is initialized in
> pseries_dlpar_init().  We need to correct that ordering.
> 

Oh, this makes much more sense. I'm going to make some tests and send a
v2 then.

Thanks for reviewing it!

> > ---
> >  arch/powerpc/platforms/pseries/dlpar.c | 10 +-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/powerpc/platforms/pseries/dlpar.c 
> > b/arch/powerpc/platforms/pseries/dlpar.c
> > index 6e35780c5962..0474aa14b5f6 100644
> > --- a/arch/powerpc/platforms/pseries/dlpar.c
> > +++ b/arch/powerpc/platforms/pseries/dlpar.c
> > @@ -399,7 +399,15 @@ void queue_hotplug_event(struct pseries_hp_errorlog 
> > *hp_errlog,
> > work->errlog = hp_errlog_copy;
> > work->hp_completion = hotplug_done;
> > work->rc = rc;
> > -   queue_work(pseries_hp_wq, (struct work_struct *)work);
> > +
> > +   /* The hotplug workqueue may happen to be NULL at the moment
> > +* this code is executed, during the boot phase. So, in this
> > +* scenario, we can fallback to the system workqueue.
> > +*/
> > +   if (unlikely(pseries_hp_wq == NULL))
> > +   schedule_work((struct work_struct *)work);
> > +   else
> > +   queue_work(pseries_hp_wq, (struct work_struct *)work);
> > } else {
> > *rc = -ENOMEM;
> > kfree(hp_errlog_copy);
> 
> -- 
> David Gibson  | I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au| minimalist, thank you.  NOT _the_ 
> _other_
>   | _way_ _around_!
> http://www.ozlabs.org/~dgibson




Re: [PATCH] powerpc/eeh: Disable EEH stack dump by default

2017-09-20 Thread joserz
On Wed, Sep 20, 2017 at 02:47:08PM +1000, Michael Ellerman wrote:
> Jose Ricardo Ziviani  writes:
> 
> > Today, each EEH causes a stack dump to be printed in the logs. In
> > production environment it's not quite necessary. Thus, this patch
> 
> I'm unconvinced. A production environment is exactly where you don't
> want to be getting an EEH, and so if you *do* then every bit of
> information is helpful.
> 
> > For example, instead of the following:
> >
> > [  131.778661] EEH: Frozen PHB#2-PE#fd detected
> > [  131.778672] EEH: PE location: N/A, PHB location: N/A
> > [  131.778677] CPU: 21 PID: 10098 Comm: lspci Not tainted ...
> > [  131.778680] Call Trace:
> > [  131.778686] [c003a140bab0] [c0beb58c] dump_stack+...
> > 
> > [  131.778770] EEH: Detected PCI bus error on PHB#2-PE#fd
> > [  131.778775] EEH: This PCI device has failed 1 times in the last hour
> > ...
> >
> > we will have this by default:
> >
> > [12777.175880] EEH: Frozen PHB#2-PE#fd detected
> > [12777.175893] EEH: PE location: N/A, PHB location: N/A
> > [12777.175922] EEH: Detected PCI bus error on PHB#2-PE#fd
> > [12777.175931] EEH: This PCI device has failed 2 times in the last hour
> 
> *What* PCI device?
> 
> How am I supposed to know what device/driver just failed? If I had the
> stack trace I could probably at least work it out based on the driver
> involved.
> 
> cheers
> 

Thank you guys! More people told me it's important to keep it as is.
Please, disregard this patch.



Re: [PATCH RFC] powerpc: Implements MMIO emulation for lvx/stvx instructions

2017-08-30 Thread joserz
On Wed, Aug 30, 2017 at 07:45:17PM +1000, Paul Mackerras wrote:
> On Tue, Aug 29, 2017 at 07:18:01PM -0300, Jose Ricardo Ziviani wrote:
> > Hello!
> > 
> > This patch implements MMIO emulation for two instructions: lvx and stvx. I 
> > started to implement other instructions but I'd like to have this reviewed 
> > beforehand because this is my first patch here and I'll certainly have some 
> > rework/fixes :-).
> > 
> > Note: stvx is only storing 8 bytes, for some reason the code 
> > "vcpu->arch.paddr_accessed += run->mmio.len;", which adds the 8-byte offset 
> > after the first write is not making any difference (interesting that it 
> > works for load operations). I'm still investigating it but any idea about 
> > it will be appreciated.
> 
> The run structure is mmapped by userspace (i.e. QEMU) and can be
> written by userspace between the first and the second exits to
> userspace (you have to do two exits to userspace because you can only
> transfer 8 bytes on each exit).  It's possible that userspace might be
> clearing run->mmio.len.  In general it's better not to rely on
> anything in *run (except of course the mmio_data for a MMIO read) when
> we come in from userspace to the kernel.
> 
> Paul.
> 

Hello Paul,

My bad, actually it works. I was mmap'ping an address that doesn't allow 
16-byte writing access. After mmap'ping a higher address (of the same device) I 
was able to perform 16-byte read/write.

== before stvx ==

  (gdb) info registers vr0
  vr0  {uint128 = 0x12345678abcdef09, ...}

  (gdb) info registers r9
  r9 0x3fffb7c90010

  (gdb) x /4wx 0x3fffb7c90010
  0x3fffb7c90010: 0x 0x 0x 0x

  (gdb) info registers r28
  r28 0x0 

stvxv0,r28,r9

== after stvx ==

  (gdb) x /4wx 0x3fffb7c90010
  0x3fffb7c90010: 0x12345678 0x 0xabcdef09 0x

== before lvx ==

  (gdb) info registers vr10
  vr10 {uint128 = 0x,...}

lvx v10,r28,r9

== after lvx ==

  (gdb) info registers vr10
  vr10 {uint128 = 0x12345678abcdef09,...}

If you think it's ok I'll submit this patch without the RFC.

Thank you very much!

Ziviani



Re: [PATCH v2] powerpc/mm: Check for _PAGE_PTE in *_devmap()

2017-07-27 Thread joserz
On Fri, Jul 28, 2017 at 01:35:53AM +1000, Oliver O'Halloran wrote:
> The ISA radix translation tree contains two different types of entry,
> directories and leaves. The formats of the two entries are different
> with the directory entries containing no spare bits for use by software.
> As a result we need to ensure that the *_devmap() family of functions
> check fail for everything except leaf (PTE) entries.
> 
> Signed-off-by: Oliver O'Halloran 
> ---
> "i'll just tweak the mbox before i sent it, what's the worst that can happen"
> *completely breaks KVM*
> "..."
> ---
>  arch/powerpc/include/asm/book3s/64/pgtable.h | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index d1da415..6bc6248 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -610,7 +610,9 @@ static inline pte_t pte_mkdevmap(pte_t pte)
> 
>  static inline int pte_devmap(pte_t pte)
>  {
> - return !!(pte_raw(pte) & cpu_to_be64(_PAGE_DEVMAP));
> + uint64_t mask = cpu_to_be64(_PAGE_DEVMAP | _PAGE_PTE);
> +
> + return (pte_raw(pte) & mask) == mask;
>  }
> 
>  static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
> -- 
> 2.7.4
> 

Michael should thank you because I was about to send another insane
suggestion and ask him if that makes sense. :-D

Tested-by: Jose Ricardo Ziviani 



Re: KVM guests freeze under upstream kernel

2017-07-26 Thread joserz
On Thu, Jul 20, 2017 at 10:18:18PM -0300, jos...@linux.vnet.ibm.com wrote:
> On Thu, Jul 20, 2017 at 03:21:59PM +1000, Paul Mackerras wrote:
> > On Thu, Jul 20, 2017 at 12:02:23AM -0300, jos...@linux.vnet.ibm.com wrote:
> > > On Thu, Jul 20, 2017 at 09:42:50AM +1000, Benjamin Herrenschmidt wrote:
> > > > On Wed, 2017-07-19 at 16:46 -0300, jos...@linux.vnet.ibm.com wrote:
> > > > > Hello!
> > > > > 
> > > > > We're not able to boot any KVM guest using upstream kernel 
> > > > > (cb8c65ccff7f77d0285f1b126c72d37b2572c865 - 4.13.0-rc1+).
> > > > > After reaching the SLOF initial counting, the guest simply freezes:
> > > > 
> > > > Can you send our .config ?
> > > 
> > > Sure,
> > > 
> > > Answering Michael as well:
> > > 
> > > It's a P9 with RHEL kernel 4.11.0-10.el7a.ppc64le installed. The problem
> > > was noticed with kernel > 4.13 (I'm currently running 4.13.0-rc1+).
> > > 
> > > QEMU is https://github.com/dgibson/qemu (ppc-for-2.10) but I gave the
> > > default packaged Qemu a try.
> > > 
> > > For the guest, I tried both a vanilla Ubuntu 17.04 and the host kernel.
> > > But they had never a chance to run since the freezing happened in SLOF.
> > > 
> > > Note that using the 4.11.0-10.el7a.ppc64le kernel it works fine
> > > (for any of these Qemu/Guest setup). With 4.13.0-rc1 I have it run after
> > > reverting that referred commit.
> > 
> > Is the host kernel running in radix mode?
> 
> yes
> 
> > 
> > Did you check the host kernel logs for any oops messages?
> 
> dmesg was clean but after sometime waiting (I forgot QEMU running in
> another terminal) I got the oops below (after rebooting the host I 
> couldn't reproduce it again).
> 
> Another test that I did was:
> Compile with transparent huge pages disabled: KVM works fine
> Compile with transparent huge pages enabled: doesn't work
>   + disabling it in /sys/kernel/mm/transparent_hugepage: doesn't work
> 
> Just out of my own curiosity I made this small change:
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h
> b/arch/powerpc/include
> index c0737c8..f94a3b6 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -80,7 +80,7 @@
>  
>   #define _PAGE_SOFT_DIRTY   _RPAGE_SW3 /* software: software dirty
>   tracking 
>#define _PAGE_SPECIAL  _RPAGE_SW2 /* software: special page */
>-#define _PAGE_DEVMAP   _RPAGE_SW1 /* software: ZONE_DEVICE page */
>+#define _PAGE_DEVMAP   _RPAGE_RSV3
> #define __HAVE_ARCH_PTE_DEVMAP
> 
> and it works. I chose _RPAGE_RSV3 because it uses the same value that
> x86 uses (0x0400UL) but I don't if it could have any side
> effect
> 

Does this change make any sense to you people?
I didn't see any side effect expect that devices backed memory will have
a bigger address space in transparent huge pages IF I understand that
correctly.

If so I can send a patch with this change.

Thank you!!



Re: KVM guests freeze under upstream kernel

2017-07-20 Thread joserz
On Thu, Jul 20, 2017 at 03:21:59PM +1000, Paul Mackerras wrote:
> On Thu, Jul 20, 2017 at 12:02:23AM -0300, jos...@linux.vnet.ibm.com wrote:
> > On Thu, Jul 20, 2017 at 09:42:50AM +1000, Benjamin Herrenschmidt wrote:
> > > On Wed, 2017-07-19 at 16:46 -0300, jos...@linux.vnet.ibm.com wrote:
> > > > Hello!
> > > > 
> > > > We're not able to boot any KVM guest using upstream kernel 
> > > > (cb8c65ccff7f77d0285f1b126c72d37b2572c865 - 4.13.0-rc1+).
> > > > After reaching the SLOF initial counting, the guest simply freezes:
> > > 
> > > Can you send our .config ?
> > 
> > Sure,
> > 
> > Answering Michael as well:
> > 
> > It's a P9 with RHEL kernel 4.11.0-10.el7a.ppc64le installed. The problem
> > was noticed with kernel > 4.13 (I'm currently running 4.13.0-rc1+).
> > 
> > QEMU is https://github.com/dgibson/qemu (ppc-for-2.10) but I gave the
> > default packaged Qemu a try.
> > 
> > For the guest, I tried both a vanilla Ubuntu 17.04 and the host kernel.
> > But they had never a chance to run since the freezing happened in SLOF.
> > 
> > Note that using the 4.11.0-10.el7a.ppc64le kernel it works fine
> > (for any of these Qemu/Guest setup). With 4.13.0-rc1 I have it run after
> > reverting that referred commit.
> 
> Is the host kernel running in radix mode?

yes

> 
> Did you check the host kernel logs for any oops messages?

dmesg was clean but after sometime waiting (I forgot QEMU running in
another terminal) I got the oops below (after rebooting the host I 
couldn't reproduce it again).

Another test that I did was:
Compile with transparent huge pages disabled: KVM works fine
Compile with transparent huge pages enabled: doesn't work
  + disabling it in /sys/kernel/mm/transparent_hugepage: doesn't work

Just out of my own curiosity I made this small change:

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h
b/arch/powerpc/include
index c0737c8..f94a3b6 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -80,7 +80,7 @@
 
  #define _PAGE_SOFT_DIRTY   _RPAGE_SW3 /* software: software dirty
  tracking 
   #define _PAGE_SPECIAL  _RPAGE_SW2 /* software: special page */
   -#define _PAGE_DEVMAP   _RPAGE_SW1 /* software: ZONE_DEVICE page */
   +#define _PAGE_DEVMAP   _RPAGE_RSV3
#define __HAVE_ARCH_PTE_DEVMAP

and it works. I chose _RPAGE_RSV3 because it uses the same value that
x86 uses (0x0400UL) but I don't if it could have any side
effect


SLOF
**
QEMU Starting
 Build Date = Mar  3 2017 13:29:19
  FW Version = git-66d250ef0fd06bb8
   Press "s" to enter Open Firmware.

   [  105.604333] Unable to handle kernel paging request for data at
   address 0x
   [  105.604448] Faulting instruction address: 0xc0910b28
   [  105.604526] Oops: Kernel access of bad area, sig: 11 [#1]
   [  105.604585] SMP NR_CPUS=2048 
   [  105.604588] NUMA 
   [  105.604633] PowerNV
   [  105.604697] Modules linked in: xt_CHECKSUM ipt_MASQUERADE
   nf_nat_masquerade_ipv4 tun ip6t_rpfilter ipt_REJECT nf_reject_ipv4
   ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat
   ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6
   nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security
   ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4
   nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw
   ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter
   kvm_hv kvm i2c_dev at24 ghash_generic ses enclosure gf128mul
   scsi_transport_sas xts sg ctr ipmi_powernv ipmi_devintf shpchp
   opal_prd vmx_crypto ipmi_msghandler uio_pdrv_genirq uio ofpart
   powernv_flash i2c_opal ibmpowernv mtd nfsd auth_rpcgss nfs_acl lockd
   grace sunrpc ip_tables xfs libcrc32c
   [  105.605561]  sd_mod ast i2c_algo_bit drm_kms_helper syscopyarea
   sysfillrect sysimgblt fb_sys_fops ttm drm i40e i2c_core aacraid ptp
   pps_core dm_mirror dm_region_hash dm_log dm_mod
   [  105.605759] CPU: 0 PID: 6 Comm: kworker/u32:0 Not tainted
   4.13.0-rc1+ #57
   [  105.605836] Workqueue: netns cleanup_net
   [  105.605880] task: c00ff6404200 task.stack: c00ff648c000
   [  105.605947] NIP: c0910b28 LR: c07cd6ec CTR:
   c07cd5d0
   [  105.606026] REGS: c00ff648f7d0 TRAP: 0300   Not tainted
   (4.13.0-rc1+)
   [  105.606090] MSR: 90009033 
   [  105.606111]   CR: 88002048  XER: 2000
   [  105.606203] CFAR: c07cd6e8 DAR:  DSISR:
   4000 SOFTE: 1 
   [  105.606203] GPR00: c07cd6ec c00ff648fa50
   c0f5c600  
   [  105.606203] GPR04: c00ff6404cc0 c00ff6404280
   782ccd5c cc908fe7 
   [  105.606203] GPR08:  c00ff648c000
   8000  
   [  105.606203] GPR12: c07cd5d0 cfb0
   c01050f8 c00ffa150ec0 

Re: KVM guests freeze under upstream kernel

2017-07-19 Thread joserz
On Thu, Jul 20, 2017 at 09:42:50AM +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2017-07-19 at 16:46 -0300, jos...@linux.vnet.ibm.com wrote:
> > Hello!
> > 
> > We're not able to boot any KVM guest using upstream kernel 
> > (cb8c65ccff7f77d0285f1b126c72d37b2572c865 - 4.13.0-rc1+).
> > After reaching the SLOF initial counting, the guest simply freezes:
> 
> Can you send our .config ?

Sure,

Answering Michael as well:

It's a P9 with RHEL kernel 4.11.0-10.el7a.ppc64le installed. The problem
was noticed with kernel > 4.13 (I'm currently running 4.13.0-rc1+).

QEMU is https://github.com/dgibson/qemu (ppc-for-2.10) but I gave the
default packaged Qemu a try.

For the guest, I tried both a vanilla Ubuntu 17.04 and the host kernel.
But they had never a chance to run since the freezing happened in SLOF.

Note that using the 4.11.0-10.el7a.ppc64le kernel it works fine
(for any of these Qemu/Guest setup). With 4.13.0-rc1 I have it run after
reverting that referred commit.

Thanks!

(config attached)
--
#
# Automatically generated file; DO NOT EDIT.
# Linux/powerpc 4.13.0-rc1 Kernel Configuration
#
CONFIG_PPC64=y

#
# Processor support
#
CONFIG_PPC_BOOK3S_64=y
# CONFIG_PPC_BOOK3E_64 is not set
# CONFIG_POWER7_CPU is not set
CONFIG_POWER8_CPU=y
CONFIG_PPC_BOOK3S=y
CONFIG_PPC_FPU=y
CONFIG_ALTIVEC=y
CONFIG_VSX=y
CONFIG_PPC_ICSWX=y
# CONFIG_PPC_ICSWX_PID is not set
# CONFIG_PPC_ICSWX_USE_SIGILL is not set
CONFIG_PPC_STD_MMU=y
CONFIG_PPC_STD_MMU_64=y
CONFIG_PPC_RADIX_MMU=y
CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION=y
CONFIG_PPC_MM_SLICES=y
CONFIG_PPC_HAVE_PMU_SUPPORT=y
CONFIG_PPC_PERF_CTRS=y
CONFIG_FORCE_SMP=y
CONFIG_SMP=y
CONFIG_NR_CPUS=2048
CONFIG_PPC_DOORBELL=y
# CONFIG_CPU_BIG_ENDIAN is not set
CONFIG_CPU_LITTLE_ENDIAN=y
CONFIG_PPC64_BOOT_WRAPPER=y
CONFIG_64BIT=y
CONFIG_ARCH_PHYS_ADDR_T_64BIT=y
CONFIG_ARCH_DMA_ADDR_T_64BIT=y
CONFIG_MMU=y
CONFIG_ARCH_MMAP_RND_BITS_MAX=29
CONFIG_ARCH_MMAP_RND_BITS_MIN=14
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=13
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=7
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NR_IRQS=512
CONFIG_NMI_IPI=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_HAS_DMA_SET_COHERENT_MASK=y
CONFIG_PPC=y
# CONFIG_GENERIC_CSUM is not set
CONFIG_EARLY_PRINTK=y
CONFIG_PANIC_TIMEOUT=180
CONFIG_COMPAT=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_SCHED_OMIT_FRAME_POINTER=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_PPC_UDBG_16550=y
# CONFIG_GENERIC_TBSYNC is not set
CONFIG_AUDIT_ARCH=y
CONFIG_GENERIC_BUG=y
CONFIG_EPAPR_BOOT=y
# CONFIG_DEFAULT_UIMAGE is not set
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
# CONFIG_PPC_DCR_NATIVE is not set
# CONFIG_PPC_DCR_MMIO is not set
# CONFIG_PPC_OF_PLATFORM_PCI is not set
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_PPC_EMULATE_SSTEP=y
CONFIG_ZONE_DMA32=y
CONFIG_PGTABLE_LEVELS=4
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_XZ is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_FHANDLE=y
# CONFIG_USELIB is not set
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_WATCH=y
CONFIG_AUDIT_TREE=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_SHOW_LEVEL=y
CONFIG_GENERIC_IRQ_MIGRATION=y
CONFIG_HARDIRQS_SW_RESEND=y
CONFIG_IRQ_DOMAIN=y
CONFIG_GENERIC_MSI_IRQ=y
# CONFIG_IRQ_DOMAIN_DEBUG is not set
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
# CONFIG_GENERIC_IRQ_DEBUGFS is not set
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_ARCH_HAS_TICK_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
# CONFIG_NO_HZ_IDLE is not set
CONFIG_NO_HZ_FULL=y
# CONFIG_NO_HZ_FULL_ALL is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y

#
# CPU/Task time and stats accounting
#
CONFIG_VIRT_CPU_ACCOUNTING=y
CONFIG_VIRT_CPU_ACCOUNTING_GEN=y
# CONFIG_IRQ_TIME_ACCOUNTING is not set
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
# CONFIG_TASKS_RCU is not set
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
CONFIG_CONTEXT_TRACKING=y
# CONFIG_CONTEXT_TRACKING_FORCE is not set
CONFIG_RCU_NOCB_CPU=y
# CONFIG_BUILD_BIN2C is not set
# 

KVM guests freeze under upstream kernel

2017-07-19 Thread joserz
Hello!

We're not able to boot any KVM guest using upstream kernel 
(cb8c65ccff7f77d0285f1b126c72d37b2572c865 - 4.13.0-rc1+).
After reaching the SLOF initial counting, the guest simply freezes:

SLOF
**
QEMU Starting
 Build Date = Mar  3 2017 13:29:19
  FW Version = git-66d250ef0fd06bb8
   Press "s" to enter Open Firmware.

   C0360

After bisecting I found the commit:

https://github.com/torvalds/linux/commit/ebd3119

powerpc/mm: Add devmap support for ppc64

Add support for the devmap bit on PTEs and PMDs for PPC64 Book3S.  This
is used to differentiate device backed memory from transparent huge
pages since they are handled in more or less the same manner by the core
mm code.

Reverting the commit and rebuilding 4.13.0-rc1+ was enough to make a workaround.
But I'll need some help from you guys in order to solve it.

Thanks!

Jose Ziviani