Re: [PULL 19/35] ppc/pnv: Add models for POWER9 PHB4 PCIe Host bridge

2022-03-31 Thread Benjamin Herrenschmidt
On Thu, 2022-03-31 at 18:51 +0100, Peter Maydell wrote:
> 
> Hi; Coverity has just spotted an error in this old change
> (CID 1487176):

Oh my this is old ... I don't work for IBM anymore but I found the
relevant doc here: 
https://wiki.raptorcs.com/w/images/a/a5/POWER9_PCIe_controller_v11_27JUL2018_pub.pdf

So

> > +++ b/hw/pci-host/pnv_phb4_pec.c
> > +static void pnv_pec_pci_xscom_write(void *opaque, hwaddr addr,
> > +uint64_t val, unsigned size)
> > +{
> > +PnvPhb4PecState *pec = PNV_PHB4_PEC(opaque);
> > +uint32_t reg = addr >> 3;
> > +
> > +switch (reg) {
> > +case PEC_PCI_PBAIB_HW_CONFIG:
> > +case PEC_PCI_PBAIB_READ_STK_OVR:
> > +pec->pci_regs[reg] = val;
> 
> This write function switches on 'reg' and is written assuming
> that these PEC_PCI* constants are valid array indexes...

They should be but...

> > +break;
> > +default:
> > +phb_pec_error(pec, "%s @0x%"HWADDR_PRIx"=%"PRIx64"\n",
> > __func__,
> > +  addr, val);
> > +}
> > +}
> > +++ b/include/hw/pci-host/pnv_phb4.h
> > +struct PnvPhb4PecStatimages/images/e {
> > +DeviceState parent;
> > +
> > +/* PEC number in chip */
> > +uint32_t index;
> > +uint32_t chip_id;images/
> > +
> > +MemoryRegion *system_memory;
> > +
> > +/* Nest registers, excuding per-stack */
> > +#define PHB4_PEC_NEST_REGS_COUNT0xf
> > +uint64_t nest_regs[PHB4_PEC_NEST_REGS_COUNT];
> > +MemoryRegion nest_regs_mr;
> > +
> > +/* PCI registers, excluding per-stack */
> > +#define PHB4_PEC_PCI_REGS_COUNT 0x2
> > +uint64_t pci_regs[PHB4_PEC_PCI_REGS_COUNT];
> > +MemoryRegion pci_regs_mr;
> 
> ...but we define the pci_regs[] array in PnvPhb4PecState to
> have only 2 elements...
> 
> > +++ b/include/hw/pci-host/pnv_phb4_regs.h
> > +/* XSCOM PCI global registers */
> > +#define PEC_PCI_PBAIB_HW_CONFIG 0x00
> > +#define PEC_PCI_PBAIB_READ_STK_OVR  0x02
> 
> ...and here we define PEC_PCI_PBAIB_READ_STK_OVR as 2, which makes
> it not a valid index into pci_regs[].
> 
> 
> Which of these is wrong?

This one:

#define PHB4_PEC_PCI_REGS_COUNT 0x2

Should be

#define PHB4_PEC_PCI_REGS_COUNT 0x3

There is no register at 0x1 though.

Cheers,
Ben.




Re: [PATCH 1/2] hw/misc/bcm2835_property: Fix framebuffer with recent RPi kernels

2021-10-18 Thread Benjamin Herrenschmidt
On Mon, 2021-10-18 at 11:47 +0200, Philippe Mathieu-Daudé wrote:
> 
> > I've just checked the rpi-5.15.y branch and it's the same.
> 
> Indeed. I stopped testing recent kernels because they use too many
> features QEMU don't implement.
> 
> Our model should generate the DTB blob of devices implemented, instead
> of giving false expectations to the guest by passing an unmodified dtb.
> 
> This is on my TODO, I might give it a try next WE.

Indeed. That said, we do implement the fb, so we probably want that
fix. The fix for the virtual gpios is probably unnecessary however if
we do what you want.

That being said, with those two fixes, I can boot the latest 5.10 I get
from raspbian.

Cheers,
Ben.





Re: [PATCH 1/2] hw/misc/bcm2835_property: Fix framebuffer with recent RPi kernels

2021-10-17 Thread Benjamin Herrenschmidt
On Sun, 2021-10-17 at 17:08 +0200, Philippe Mathieu-Daudé wrote:
> Hi Benjamin,
> 
> On 10/17/21 09:48, Benjamin Herrenschmidt wrote:
> > The framebuffer driver fails to initialize with recent Raspberry Pi
> > kernels, such as the ones shipped in the current RaspiOS images
> > (with the out of tree bcm2708_fb.c driver)
> 
> Which particular version?

The one I dug out of 2021-05-07-raspios-buster-arm64-lite.img (note
that this then fails to boot some time after the fb is setup, even
after the fix, in the vchip driver init (before serial is up even).

That said, the same fb problem happens with 5.10.60-v8+ from raspbian.

I'm not sure your fix will work on these, see below:

> +case 0x00040013: /* Get number of displays */
> +stl_le_phys(>dma_as, value + 12, 0 /* old fw: unique display 
> */);
> +resplen = 4;
> +break;
> 
This should have been equivalen to not having the property. However,
the failure path in the driver looks like this (note the mismatch
between the comment and the code.. this is rpi 5.10.73 from the rpi
repo :

ret = rpi_firmware_property(fw,
RPI_FIRMWARE_FRAMEBUFFER_GET_NUM_DISPLAYS,
_displays, sizeof(u32));

/* If we fail to get the number of displays, or it returns 0, then
 * assume old firmware that doesn't have the mailbox call, so just
 * set one display
 */
if (ret || num_displays == 0) {
dev_err(>dev,
"Unable to determine number of FBs. Disabling 
driver.\n");
return -ENOENT;
} else {
fbdev->firmware_supports_multifb = 1;
}

So it appears that the intention at some stage was to set only one display but
the code as written will fail to initialize the drive if the properly is absent
*or* if it returns 0.

I've just checked the rpi-5.15.y branch and it's the same.

Cheers,
Ben.





[PATCH 2/2] hw/misc/bcm2835_property: Add dummy Get/Set GPIO virt buf messages

2021-10-17 Thread Benjamin Herrenschmidt
Without these the RaspiOS kernel tries to ioremap some bogus address
and dumps a backtrace in the console at boot. These work around it.

The virt-gpio driver still fails to initialize but much more cleanly

Signed-off-by: Benjamin Herrenschmidt 
---
 hw/misc/bcm2835_property.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/hw/misc/bcm2835_property.c b/hw/misc/bcm2835_property.c
index b958fa6a5c..62037c0630 100644
--- a/hw/misc/bcm2835_property.c
+++ b/hw/misc/bcm2835_property.c
@@ -274,6 +274,13 @@ static void 
bcm2835_property_mbox_push(BCM2835PropertyState *s, uint32_t value)
 resplen = 4;
 break;
 
+case 0x00048020: /* Set GPIO virt buf */
+/* fall through */
+case 0x00040010: /* Get GPIO virt buf */
+stl_le_phys(>dma_as, value + 12, 0);
+resplen = 4;
+break;
+
 case 0x00060001: /* Get DMA channels */
 /* channels 2-5 */
 stl_le_phys(>dma_as, value + 12, 0x003C);





[PATCH 1/2] hw/misc/bcm2835_property: Fix framebuffer with recent RPi kernels

2021-10-17 Thread Benjamin Herrenschmidt
The framebuffer driver fails to initialize with recent Raspberry Pi
kernels, such as the ones shipped in the current RaspiOS images
(with the out of tree bcm2708_fb.c driver)

The reason is that this driver uses a new firmware call to query the
number of displays, and the fallback when this call fails is broken.

So implement the call and claim we have exactly one display

Signed-off-by: Benjamin Herrenschmidt 
---
 hw/misc/bcm2835_property.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/hw/misc/bcm2835_property.c b/hw/misc/bcm2835_property.c
index 73941bdae9..b958fa6a5c 100644
--- a/hw/misc/bcm2835_property.c
+++ b/hw/misc/bcm2835_property.c
@@ -269,6 +269,10 @@ static void 
bcm2835_property_mbox_push(BCM2835PropertyState *s, uint32_t value)
 stl_le_phys(>dma_as, value + 12, 0);
 resplen = 4;
 break;
+case 0x00040013: /* Get num displays */
+stl_le_phys(>dma_as, value + 12, 1);
+resplen = 4;
+break;
 
 case 0x00060001: /* Get DMA channels */
 /* channels 2-5 */





Re: [PATCH for-6.1? v2 7/9] hw/pci-hist/pnv_phb4: Fix typo in pnv_phb4_ioda_write

2021-07-25 Thread Benjamin Herrenschmidt
On Sun, 2021-07-25 at 23:27 +0200, Philippe Mathieu-Daudé wrote:
> +Cédric/Benjamin
> 
> On 7/25/21 2:24 PM, Richard Henderson wrote:
> > From clang-13:
> > hw/pci-host/pnv_phb4.c:375:18: error: variable 'v' set but not used
> > \
> > [-Werror,-Wunused-but-set-variable]
> > 
> > It's pretty clear that we meant to write back 'v' after
> > all that computation and not 'val'.
> > 
> 
> Fixes: 4f9924c4d4c ("ppc/pnv: Add models for POWER9 PHB4 PCIe Host
> bridge")

Acked-by: Benjamin Herrenschmidt 

> 
> > Acked-by: David Gibson 
> > Signed-off-by: Richard Henderson 
> > ---
> >  hw/pci-host/pnv_phb4.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/hw/pci-host/pnv_phb4.c b/hw/pci-host/pnv_phb4.c
> > index 54f57c660a..5c375a9f28 100644
> > --- a/hw/pci-host/pnv_phb4.c
> > +++ b/hw/pci-host/pnv_phb4.c
> > @@ -392,7 +392,7 @@ static void pnv_phb4_ioda_write(PnvPHB4 *phb,
> > uint64_t val)
> >  v &= 0xull;
> >  v |= 0xcfffull & val;
> >  }
> > -*tptr = val;
> > +*tptr = v;
> >  break;
> >  }
> >  case IODA3_TBL_MBT:
> > 




Re: [EXTERNAL] [PATCH 2/2] target/ppc: Fix ISA v3.0 (POWER9) slbia implementation

2020-03-18 Thread Benjamin Herrenschmidt
On Wed, 2020-03-18 at 18:08 +0100, Cédric Le Goater wrote:
> On 3/18/20 5:41 AM, Nicholas Piggin wrote:
> > Linux using the hash MMU ("disable_radix" command line) on a POWER9
> > machine quickly hits translation bugs due to using v3.0 slbia
> > features that are not implemented in TCG. Add them.
> 
> I checked the ISA books and this looks OK but you are also modifying
> slbie.

For the same reason, I believe slbie needs to invalidate caches even if
the entry isn't present.

The kernel will under some circumstances overwrite SLB entries without
invalidating (because the translation itself isn't invalid, it's just
that the SLB is full, so anything cached in the ERAT is still
technically ok).

However, when those things get really invalidated, they need to be
taken out, even if they no longer have a corresponding SLB entry.

Cheers,
Ben.

> Thanks,
> 
> C. 
> 
> 
> > Signed-off-by: Nicholas Piggin 
> > ---
> >  target/ppc/helper.h |  2 +-
> >  target/ppc/mmu-hash64.c | 57 -
> > 
> >  target/ppc/translate.c  |  5 +++-
> >  3 files changed, 55 insertions(+), 9 deletions(-)
> > 
> > diff --git a/target/ppc/helper.h b/target/ppc/helper.h
> > index ee1498050d..2dfa1c6942 100644
> > --- a/target/ppc/helper.h
> > +++ b/target/ppc/helper.h
> > @@ -615,7 +615,7 @@ DEF_HELPER_FLAGS_3(store_slb, TCG_CALL_NO_RWG,
> > void, env, tl, tl)
> >  DEF_HELPER_2(load_slb_esid, tl, env, tl)
> >  DEF_HELPER_2(load_slb_vsid, tl, env, tl)
> >  DEF_HELPER_2(find_slb_vsid, tl, env, tl)
> > -DEF_HELPER_FLAGS_1(slbia, TCG_CALL_NO_RWG, void, env)
> > +DEF_HELPER_FLAGS_2(slbia, TCG_CALL_NO_RWG, void, env, i32)
> >  DEF_HELPER_FLAGS_2(slbie, TCG_CALL_NO_RWG, void, env, tl)
> >  DEF_HELPER_FLAGS_2(slbieg, TCG_CALL_NO_RWG, void, env, tl)
> >  #endif
> > diff --git a/target/ppc/mmu-hash64.c b/target/ppc/mmu-hash64.c
> > index 373d44de74..deb1c13a66 100644
> > --- a/target/ppc/mmu-hash64.c
> > +++ b/target/ppc/mmu-hash64.c
> > @@ -95,9 +95,10 @@ void dump_slb(PowerPCCPU *cpu)
> >  }
> >  }
> > 
> > -void helper_slbia(CPUPPCState *env)
> > +void helper_slbia(CPUPPCState *env, uint32_t ih)
> >  {
> >  PowerPCCPU *cpu = env_archcpu(env);
> > +int starting_entry;
> >  int n;
> > 
> >  /*
> > @@ -111,18 +112,59 @@ void helper_slbia(CPUPPCState *env)
> >   * expected that slbmte is more common than slbia, and slbia
> > is usually
> >   * going to evict valid SLB entries, so that tradeoff is
> > unlikely to be a
> >   * good one.
> > + *
> > + * ISA v2.05 introduced IH field with values 0,1,2,6. These
> > all invalidate
> > + * the same SLB entries (everything but entry 0), but differ
> > in what
> > + * "lookaside information" is invalidated. TCG can ignore this
> > and flush
> > + * everything.
> > + *
> > + * ISA v3.0 introduced additional values 3,4,7, which change
> > what SLBs are
> > + * invalidated.
> >   */
> > 
> > -/* XXX: Warning: slbia never invalidates the first segment */
> > -for (n = 1; n < cpu->hash64_opts->slb_size; n++) {
> > -ppc_slb_t *slb = >slb[n];
> > +env->tlb_need_flush |= TLB_NEED_LOCAL_FLUSH;
> > +
> > +starting_entry = 1; /* default for IH=0,1,2,6 */
> > +
> > +if (env->mmu_model == POWERPC_MMU_3_00) {
> > +switch (ih) {
> > +case 0x7:
> > +/* invalidate no SLBs, but all lookaside information
> > */
> > +return;
> > 
> > -if (slb->esid & SLB_ESID_V) {
> > -slb->esid &= ~SLB_ESID_V;
> > +case 0x3:
> > +case 0x4:
> > +/* also considers SLB entry 0 */
> > +starting_entry = 0;
> > +break;
> > +
> > +case 0x5:
> > +/* treat undefined values as ih==0, and warn */
> > +qemu_log_mask(LOG_GUEST_ERROR,
> > +  "slbia undefined IH field %u.\n", ih);
> > +break;
> > +
> > +default:
> > +/* 0,1,2,6 */
> > +break;
> >  }
> >  }
> > 
> > -env->tlb_need_flush |= TLB_NEED_LOCAL_FLUSH;
> > +for (n = starting_entry; n < cpu->hash64_opts->slb_size; n++)
> > {
> > +ppc_slb_t *slb = >slb[n];
> > +
> > +if (!(slb->esid & SLB_ESID_V)) {
> > +continue;
> > +}
> > +if (env->mmu_model == POWERPC_MMU_3_00) {
> > +if (ih == 0x3 && (slb->vsid & SLB_VSID_C) == 0) {
> > +/* preserves entries with a class value of 0 */
> > +continue;
> > +}
> > +}
> > +
> > +slb->esid &= ~SLB_ESID_V;
> > +}
> >  }
> > 
> >  static void __helper_slbie(CPUPPCState *env, target_ulong addr,
> > @@ -136,6 +178,7 @@ static void __helper_slbie(CPUPPCState *env,
> > target_ulong addr,
> >  return;
> >  }
> > 
> > +env->tlb_need_flush |= TLB_NEED_LOCAL_FLUSH;
> >  if (slb->esid & SLB_ESID_V) {
> >  slb->esid &= ~SLB_ESID_V;
> > 
> > diff --git a/target/ppc/translate.c 

Re: Semihosting, arm, riscv, ppc and common code

2020-01-15 Thread Benjamin Herrenschmidt
On Wed, 2020-01-15 at 12:01 +, Alex Bennée wrote:
> 
> > There seem to be some linux-user stuff in there, I'm mostly considering
> > whatever ARM does today but we can certainly extend later.
> 
> Depends on if it is to be used. AFAIK the main users of arm linux user
> are compiler test cases for M-profile CPUs. 

For microwatt I was going to implement HW support as well via JTAG
but the user linux-user bits are less obviously useful to me.

That said, most of that code can trivially be made arch neutral by
replacing the use of the arch specific CPU type with CPUState as
the first argument to most functions. There are only a handful of arch
specific helpers needed from there to extract the op & arg, set the
result etc..

> > The idea is to make sure ARM, RiscV and POWER use the same protocol and
> > code base to make picolibc (and others) life easier. Bug compatible
> > :-)
> 
> Hmm, I'm not so sure. QEMU tries to emulate real HW although I
> appreciate that is a somewhat loose definition once you get to things
> like -M virt and other such SW like abstractions. Is semihosting even
> going to be a thing on real RiscV/Power silicon?

It will be on microwatt once I add support for it. We could probably
make it work on real power9 if the systems give access to the external
debug facilities of the processor as well. I'm no longer involved with
powerpc in a professional capacity but I can ask Anton or Paul to help
there.

Cheers,
Ben.





Re: Semihosting, arm, riscv, ppc and common code

2020-01-15 Thread Benjamin Herrenschmidt
On Thu, 2020-01-16 at 00:02 +0200, Liviu Ionescu wrote:
> > ... they did have the opportunity to do better, and did not.
> 
> I don't know why you present Arm semihosting as a disaster. It is not
> perfect, but it is functional, and common unit tests use only a small
> subset of the calls.
> 
> And there is no 'window of opportunity', if the RISC-V guys will ever
> want to reinvent the wheel and come with an official 'RISC-V
> semihosting' specs, they can do it at any time, and this will have no
> impact on existing devices, everything will continue to work as
> before, only the debuggers/emulators will need to be upgraded.
> 
> But the only immediate effect such a move will have is that software
> efforts in test frameworks will be increased, to support another
> protocol, while the advantages will be minimal.

I agree, which is also why I want to use the same interface for
powerpc, it will simply make life easier for everybody. The calls
aren't perfect but they do work sufficiently well to be useful and I
haven't yet been convinced that it can't be extended if necessary.

Cheers,
Ben.





Re: Semihosting, arm, riscv, ppc and common code

2020-01-15 Thread Benjamin Herrenschmidt
On Wed, 2020-01-15 at 13:32 +, Peter Maydell wrote:
> On Wed, 15 Jan 2020 at 01:17, Benjamin Herrenschmidt
>  wrote:
> > On Tue, 2020-01-14 at 09:59 +, Peter Maydell wrote:
> > > Note that semihosting is not a "here's a handy QEMU feature"
> > > thing. It's an architecture-specific API and ABI, which should
> > > be defined somewhere in a standard external to QEMU.
> > 
> > There is no such standard for powerpc today that I know of.
> 
> So you need to write one down somewhere. You're proposing
> an ABI which will be implemented by multiple implementors
> and used by multiple consumers. That needs a spec, not
> just code. I don't want to see more semihosting implementations
> added to QEMU that don't have a specification written
> down somewhere.

That's ok, I can probably get openpower to put a link to the ARM one
somewhere :-)

> The point about the mistakes is that you can't easily fix
> them by adding extensions, because the core of the biggest
> mistake is "we didn't provide a good way to allow extensions to
> be added and probed for by the user". So we had to implement
> an ugly and hard to implement mechanism instead.
>  New
> architectures could mandate providing the good way from the start
> and avoid the painful-to-implement approach entirely.
> (I think RISCV has already missed this window of opportunity,
> which is a shame.)

It is done and so now we have two architectures using that standard, I
think the value in using the same overweight the value in fixing it but
yes, we should try to agree on a method of extending at least. Is it
really that hard ?

IE. We could add a new call for capabilities that takes a pointer to
a region which we pre-zero before calling in the client and if remains
zero after the call, then the new stuff isn't there for example. That
sort of stuff is easy, or am I missing something ?

Cheers,
Ben.





Re: Semihosting, arm, riscv, ppc and common code

2020-01-14 Thread Benjamin Herrenschmidt
On Tue, 2020-01-14 at 09:59 +, Peter Maydell wrote:
> Note that semihosting is not a "here's a handy QEMU feature"
> thing. It's an architecture-specific API and ABI, which should
> be defined somewhere in a standard external to QEMU.

There is no such standard for powerpc today that I know of.

> You need to start by having a definition for PPC of what
> semihosting is. If you're starting from scratch there, there
> are some important things you should do differently to Arm --
> there is no benefit to repeating the mistakes of API definition
> that we made! Most notably, you want to specify and require
> that any unrecognized semihosting call function fails in a
> clean and detectable way; you also should have a semihosting
> function for "ask for a feature bit mask" so you don't need
> the silly magic-filename approach Arm had to go for. You
> also want to standardize what the errno values are, which Arm
> forgot to do and which makes the errno handling in the spec
> pretty useless.

Keith and I are somewhat of a different mind here. From the perspective
of the user of that API (picolibc is one), it's easier to deal with a
single one and have everybody inherit the same bugs.

Now I understand the point of wanting to fix the mistakes made but I
would suggest we do so by proposing extensions to the existing one to
do so.

> TLDR: don't start by writing code, start by writing the *API/ABI
> spec*.
> I tried to push the RISCV folks in this direction as well.

AFAIK they are still just doing what ARM does for the above reason.

Cheers,
Ben.





Re: Semihosting, arm, riscv, ppc and common code

2020-01-14 Thread Benjamin Herrenschmidt
On Tue, 2020-01-14 at 09:51 +, Alex Bennée wrote:
> > Well, one of the LCA talks wasn't that interesting so I started
> > doing
> > it and am almost done :-)
> > 
> > I'll look at doing something for arm, riscv and ppc and send
> > patches
> > once I get it working.
> 
> Cool. Are you considering linux-user as well or are these only things
> that make sense for system emulation?

There seem to be some linux-user stuff in there, I'm mostly considering
whatever ARM does today but we can certainly extend later.

The idea is to make sure ARM, RiscV and POWER use the same protocol and
code base to make picolibc (and others) life easier. Bug compatible :-)

Cheers,
Ben.





Re: Semihosting, arm, riscv, ppc and common code

2020-01-13 Thread Benjamin Herrenschmidt
On Tue, 2020-01-14 at 09:32 +0200, Liviu Ionescu wrote:
> > On 14 Jan 2020, at 08:25, Benjamin Herrenschmidt <
> > b...@kernel.crashing.org> wrote:
> > 
> > I noticed that the bulk of arm-semi.c (or riscv-semi.c) is
> > trivially
> > made completely generic by doing a couple of changes:
> 
> Last year I did a similar exercise in OpenOCD, where I took the Arm
> semihosting code from the Arm specific location, extracted the common
> part, updated it for latest Arm 64-bit specs, and then used the
> common code for both Arm and RISC-V.
> 
> If you think useful, you might take a look there too.

Well, one of the LCA talks wasn't that interesting so I started doing
it and am almost done :-)

I'll look at doing something for arm, riscv and ppc and send patches
once I get it working.

Cheers,
Ben.

> 
> Regards,
> 
> Liviu




Semihosting, arm, riscv, ppc and common code

2020-01-13 Thread Benjamin Herrenschmidt
Hi Folks !

So I started "porting" over (read: copying) the arm semihosting code to
ppc to mimmic what Keith did for risv (mostly for picolibc support).

I noticed that the bulk of arm-semi.c (or riscv-semi.c) is trivially
made completely generic by doing a couple of changes:

 - Make most functions take a CPUState instead of the architecture
specific "env"

 - Provide arch helpers to retreive the op, set the result, do the
flen bug hack and possibly a couple of others (I'm not done yet).

There are other archs who do semihosting completely differently but at
least those 3 can share code.

Any objection/comment on the approach ? What I'll probably do is get
things going first with my ppc version (which I made more/less generic
but still located in target/ppc) at which point I can post an RFC so
you get an idea, and we find a good location for it.

>From thre we might consider fixing some of the worst crap in there
doing backwards compatible extensions if we care enough :-)

Cheers,
Ben.





Re: [Qemu-devel] [PATCH 5/6] memory_ldst: Add atomic ops for PTE updates

2019-04-14 Thread Benjamin Herrenschmidt
On Mon, 2019-04-15 at 13:38 +1000, David Gibson wrote:
> On Thu, Apr 11, 2019 at 10:00:03AM +0200, Cédric Le Goater wrote:
> > From: Benjamin Herrenschmidt 
> > 
> > On some architectures, PTE updates for dirty and changed bits need
> > to be performed atomically. This adds a couple of
> > address_space_cmpxchg*
> > helpers for that purpose.
> > 
> > Signed-off-by: Benjamin Herrenschmidt 
> > Signed-off-by: Cédric Le Goater 
> 
> Reviewed-by: David Gibson 
> 
> But I think this needs to go past Paolo for review as memory
> subsystem maintainer.

This needs to go to rth and Peter. They were talking about a change in
the abstraction to do these more cleanly... that said, it might be
worthwhile merging the fixes first.

Cheers,
Ben.

> 
> > ---
> >  include/exec/memory_ldst.inc.h |  6 +++
> >  memory_ldst.inc.c  | 80
> > ++
> >  2 files changed, 86 insertions(+)
> > 
> > diff --git a/include/exec/memory_ldst.inc.h
> > b/include/exec/memory_ldst.inc.h
> > index 272c20f02eae..f3cfa7e9a622 100644
> > --- a/include/exec/memory_ldst.inc.h
> > +++ b/include/exec/memory_ldst.inc.h
> > @@ -28,6 +28,12 @@ extern uint64_t glue(address_space_ldq,
> > SUFFIX)(ARG1_DECL,
> >  hwaddr addr, MemTxAttrs attrs, MemTxResult *result);
> >  extern void glue(address_space_stl_notdirty, SUFFIX)(ARG1_DECL,
> >  hwaddr addr, uint32_t val, MemTxAttrs attrs, MemTxResult
> > *result);
> > +extern uint32_t glue(address_space_cmpxchgl_notdirty,
> > SUFFIX)(ARG1_DECL,
> > +hwaddr addr, uint32_t old, uint32_t new, MemTxAttrs attrs,
> > +MemTxResult *result);
> > +extern uint32_t glue(address_space_cmpxchgq_notdirty,
> > SUFFIX)(ARG1_DECL,
> > +hwaddr addr, uint64_t old, uint64_t new, MemTxAttrs attrs,
> > +MemTxResult *result);
> >  extern void glue(address_space_stw, SUFFIX)(ARG1_DECL,
> >  hwaddr addr, uint32_t val, MemTxAttrs attrs, MemTxResult
> > *result);
> >  extern void glue(address_space_stl, SUFFIX)(ARG1_DECL,
> > diff --git a/memory_ldst.inc.c b/memory_ldst.inc.c
> > index acf865b900d7..1d58d2fea67d 100644
> > --- a/memory_ldst.inc.c
> > +++ b/memory_ldst.inc.c
> > @@ -320,6 +320,86 @@ void glue(address_space_stl_notdirty,
> > SUFFIX)(ARG1_DECL,
> >  RCU_READ_UNLOCK();
> >  }
> >  
> > +/* This is meant to be used for atomic PTE updates under MT-TCG */
> > +uint32_t glue(address_space_cmpxchgl_notdirty, SUFFIX)(ARG1_DECL,
> > +hwaddr addr, uint32_t old, uint32_t new, MemTxAttrs attrs,
> > +MemTxResult *result)
> > +{
> > +uint8_t *ptr;
> > +MemoryRegion *mr;
> > +hwaddr l = 4;
> > +hwaddr addr1;
> > +MemTxResult r;
> > +uint8_t dirty_log_mask;
> > +
> > +/* Must test result */
> > +assert(result);
> > +
> > +RCU_READ_LOCK();
> > +mr = TRANSLATE(addr, , , true, attrs);
> > +if (l < 4 || !memory_access_is_direct(mr, true)) {
> > +r = MEMTX_ERROR;
> > +} else {
> > +uint32_t orig = old;
> > +
> > +ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
> > +old = atomic_cmpxchg(ptr, orig, new);
> > +
> > +if (old == orig) {
> > +dirty_log_mask = memory_region_get_dirty_log_mask(mr);
> > +dirty_log_mask &= ~(1 << DIRTY_MEMORY_CODE);
> > +cpu_physical_memory_set_dirty_range(memory_region_get_
> > ram_addr(mr) +
> > +addr, 4,
> > dirty_log_mask);
> > +}
> > +r = MEMTX_OK;
> > +}
> > +*result = r;
> > +RCU_READ_UNLOCK();
> > +
> > +return old;
> > +}
> > +
> > +#ifdef CONFIG_ATOMIC64
> > +/* This is meant to be used for atomic PTE updates under MT-TCG */
> > +uint32_t glue(address_space_cmpxchgq_notdirty, SUFFIX)(ARG1_DECL,
> > +hwaddr addr, uint64_t old, uint64_t new, MemTxAttrs attrs,
> > +MemTxResult *result)
> > +{
> > +uint8_t *ptr;
> > +MemoryRegion *mr;
> > +hwaddr l = 8;
> > +hwaddr addr1;
> > +MemTxResult r;
> > +uint8_t dirty_log_mask;
> > +
> > +/* Must test result */
> > +assert(result);
> > +
> > +RCU_READ_LOCK();
> > +mr = TRANSLATE(addr, , , true, attrs);
> > +if (l < 8 || !memory_access_is_direct(mr, true)) {
> > +r = MEMTX_ERROR;
> > +} else {
> > +uint32_t orig = old;
> >

Re: [Qemu-devel] [PATCH 07/19] target/ppc: Make special ORs match x86 pause and don't generate on mttcg

2019-02-12 Thread Benjamin Herrenschmidt
On Tue, 2019-02-12 at 16:59 +1100, David Gibson wrote:
> On Mon, Jan 28, 2019 at 10:46:13AM +0100, Cédric Le Goater wrote:
> > From: Benjamin Herrenschmidt 
> > 
> > There's no point in going out of translation on an SMT OR with
> > mttcg since the backend won't do anything useful such as pausing,
> > it's only useful on traditional TCG to give time to other
> > processors.
> 
> Is it actively harmful in the MTTCG case, or just pointless?

I think it can hurt performance, I don't remember for sure :)

> > Signed-off-by: Benjamin Herrenschmidt 
> > Signed-off-by: Cédric Le Goater 
> > ---
> >  target/ppc/translate.c | 6 --
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/target/ppc/translate.c b/target/ppc/translate.c
> > index e169c43643a1..7d40a1fbe6bd 100644
> > --- a/target/ppc/translate.c
> > +++ b/target/ppc/translate.c
> > @@ -1580,7 +1580,7 @@ static void gen_pause(DisasContext *ctx)
> >  tcg_temp_free_i32(t0);
> >  
> >  /* Stop translation, this gives other CPUs a chance to run */
> > -gen_exception_nip(ctx, EXCP_HLT, ctx->base.pc_next);
> > +gen_exception_nip(ctx, EXCP_INTERRUPT, ctx->base.pc_next);
> 
> I don't see how this change relates to the rest.

Yeah not sure anymore :-)

> >  }
> >  #endif /* defined(TARGET_PPC64) */
> >  
> > @@ -1662,7 +1662,9 @@ static void gen_or(DisasContext *ctx)
> >   * than no-op, e.g., miso(rs=26), yield(27), mdoio(29), mdoom(30),
> >   * and all currently undefined.
> >   */
> > -gen_pause(ctx);
> > +if (!mttcg_enabled) {
> > +gen_pause(ctx);
> > +}
> >  #endif
> >  #endif
> >  }




Re: [Qemu-devel] [PATCH 08/19] target/ppc: Fix nip on power management instructions

2019-02-12 Thread Benjamin Herrenschmidt
On Tue, 2019-02-12 at 17:02 +1100, David Gibson wrote:
> On Mon, Jan 28, 2019 at 10:46:14AM +0100, Cédric Le Goater wrote:
> > From: Benjamin Herrenschmidt 
> > 
> > Those instructions currently raise an exception from within
> > the helper. This tends to result in a bogus nip value in
> > the env context (typically the beginning of the TB). Such
> > a helper needs a gen_update_nip() first.
> > 
> > This fixes it with a different approach which is to throw
> > the exception from translate.c instead of the helper using
> > gen_exception_nip() which does the right thing.
> > 
> > Signed-off-by: Benjamin Herrenschmidt 
> > Signed-off-by: Cédric Le Goater 
> > ---
> >  target/ppc/excp_helper.c |  1 -
> >  target/ppc/translate.c   | 12 
> >  2 files changed, 8 insertions(+), 5 deletions(-)
> > 
> > diff --git a/target/ppc/excp_helper.c b/target/ppc/excp_helper.c
> > index 751d759fcc1d..8407e0ade938 100644
> > --- a/target/ppc/excp_helper.c
> > +++ b/target/ppc/excp_helper.c
> > @@ -958,7 +958,6 @@ void helper_pminsn(CPUPPCState *env, powerpc_pm_insn_t 
> > insn)
> >   * but this doesn't seem to be a problem.
> >   */
> >  env->msr |= (1ull << MSR_EE);
> > -raise_exception(env, EXCP_HLT);
> >  }
> >  #endif /* defined(TARGET_PPC64) */
> >  
> > diff --git a/target/ppc/translate.c b/target/ppc/translate.c
> > index 7d40a1fbe6bd..55281a8975e0 100644
> > --- a/target/ppc/translate.c
> > +++ b/target/ppc/translate.c
> > @@ -3571,7 +3571,8 @@ static void gen_doze(DisasContext *ctx)
> >  t = tcg_const_i32(PPC_PM_DOZE);
> >  gen_helper_pminsn(cpu_env, t);
> >  tcg_temp_free_i32(t);
> > -gen_stop_exception(ctx);
> > +/* Stop translation, as the CPU is supposed to sleep from now */
> > +gen_exception_nip(ctx, EXCP_HLT, ctx->base.pc_next);
> 
> IIUC this also changes from EXCP_STOP to EXCP_HLT, is that intentional?

Off the top of my head, it might be to break out of the outer exec
loop, but I don't remember off hand.

> >  #endif /* defined(CONFIG_USER_ONLY) */
> >  }
> >  
> > @@ -3586,7 +3587,8 @@ static void gen_nap(DisasContext *ctx)
> >  t = tcg_const_i32(PPC_PM_NAP);
> >  gen_helper_pminsn(cpu_env, t);
> >  tcg_temp_free_i32(t);
> > -gen_stop_exception(ctx);
> > +/* Stop translation, as the CPU is supposed to sleep from now */
> > +gen_exception_nip(ctx, EXCP_HLT, ctx->base.pc_next);
> >  #endif /* defined(CONFIG_USER_ONLY) */
> >  }
> >  
> > @@ -3606,7 +3608,8 @@ static void gen_sleep(DisasContext *ctx)
> >  t = tcg_const_i32(PPC_PM_SLEEP);
> >  gen_helper_pminsn(cpu_env, t);
> >  tcg_temp_free_i32(t);
> > -gen_stop_exception(ctx);
> > +/* Stop translation, as the CPU is supposed to sleep from now */
> > +gen_exception_nip(ctx, EXCP_HLT, ctx->base.pc_next);
> >  #endif /* defined(CONFIG_USER_ONLY) */
> >  }
> >  
> > @@ -3621,7 +3624,8 @@ static void gen_rvwinkle(DisasContext *ctx)
> >  t = tcg_const_i32(PPC_PM_RVWINKLE);
> >  gen_helper_pminsn(cpu_env, t);
> >  tcg_temp_free_i32(t);
> > -gen_stop_exception(ctx);
> > +/* Stop translation, as the CPU is supposed to sleep from now */
> > +gen_exception_nip(ctx, EXCP_HLT, ctx->base.pc_next);
> >  #endif /* defined(CONFIG_USER_ONLY) */
> >  }
> >  #endif /* #if defined(TARGET_PPC64) */




Re: [Qemu-devel] [PATCH 1/3] memory_ldst: Add atomic ops for PTE updates

2018-12-13 Thread Benjamin Herrenschmidt
On Thu, 2018-12-13 at 21:01 -0600, Richard Henderson wrote:
> On 12/13/18 5:58 PM, Benjamin Herrenschmidt wrote:
> > +#ifdef CONFIG_ATOMIC64
> > +/* This is meant to be used for atomic PTE updates under MT-TCG */
> > +uint32_t glue(address_space_cmpxchgq_notdirty, SUFFIX)(ARG1_DECL,
> > +hwaddr addr, uint64_t old, uint64_t new, MemTxAttrs attrs, MemTxResult 
> > *result)
> > +{
> > +uint8_t *ptr;
> > +MemoryRegion *mr;
> > +hwaddr l = 8;
> > +hwaddr addr1;
> > +MemTxResult r;
> > +uint8_t dirty_log_mask;
> > +
> > +/* Must test result */
> > +assert(result);
> > +
> > +RCU_READ_LOCK();
> > +mr = TRANSLATE(addr, , , true, attrs);
> > +if (l < 8 || !memory_access_is_direct(mr, true)) {
> > +r = MEMTX_ERROR;
> > +} else {
> > +uint32_t orig = old;
> > +
> > +ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
> > +old = atomic_cmpxchg(ptr, orig, new);
> > +
> 
> I think you need atomic_cmpxchg__nocheck here.
> 
> Failure would be with a 32-bit host that supports ATOMIC64.
> E.g. i686.

I'm confused by this and the comments around the definition of
ATOMIC_REG_SIZE :)

So would we have CONFIG_ATOMIC64 in that case and if yes why if all the
atomic_* end up barfing ?

Or rather, why set CONFIG_ATOMIC64 if we ought not to use 64-bit
atomics ?

Also we should probably define ATOMIC_REG_SIZE to 8 for ppc64...

Cheers
Ben.
> 
> r~




Re: [Qemu-devel] [PATCH 2/3] i386: Atomically update PTEs with mttcg

2018-12-13 Thread Benjamin Herrenschmidt
Note to RiscV folks: You may want to adapt your code to do the same as
this, esp. afaik, what you do today is endian-broken if host and guest
endian are different.

Cheers,
Ben. 

On Fri, 2018-12-14 at 10:58 +1100, Benjamin Herrenschmidt wrote:
> Afaik, this isn't well documented (at least it wasn't when I last looked)
> but OSes such as Linux rely on this behaviour:
> 
> The HW updates to the page tables need to be done atomically with the
> checking of the present bit (and other permissions).
> 
> This is what allows Linux to do simple xchg of PTEs with 0 and assume
> the value read has "final" stable dirty and accessed bits (the TLB
> invalidation is deferred).
> 
> Signed-off-by: Benjamin Herrenschmidt 
> ---
>  target/i386/excp_helper.c | 104 +-
>  1 file changed, 80 insertions(+), 24 deletions(-)
> 
> diff --git a/target/i386/excp_helper.c b/target/i386/excp_helper.c
> index 49231f6b69..93fc24c011 100644
> --- a/target/i386/excp_helper.c
> +++ b/target/i386/excp_helper.c
> @@ -157,11 +157,45 @@ int x86_cpu_handle_mmu_fault(CPUState *cs, vaddr addr, 
> int size,
>  
>  #else
>  
> +static inline uint64_t update_entry(CPUState *cs, target_ulong addr,
> +uint64_t orig_entry, uint32_t bits)
> +{
> +uint64_t new_entry = orig_entry | bits;
> +
> +/* Write the updated bottom 32-bits */
> +if (qemu_tcg_mttcg_enabled()) {
> +uint32_t old_le = cpu_to_le32(orig_entry);
> +uint32_t new_le = cpu_to_le32(new_entry);
> +MemTxResult result;
> +uint32_t old_ret;
> +
> +old_ret = address_space_cmpxchgl_notdirty(cs->as, addr,
> +  old_le, new_le,
> +  MEMTXATTRS_UNSPECIFIED,
> +  );
> +if (result == MEMTX_OK) {
> +if (old_ret != old_le && old_ret != new_le) {
> +new_entry = 0;
> +}
> +return new_entry;
> +}
> +
> +/* Do we need to support this case where PTEs aren't in RAM ?
> + *
> + * For now fallback to non-atomic case
> + */
> +}
> +
> +x86_stl_phys_notdirty(cs, addr, new_entry);
> +
> +return new_entry;
> +}
> +
>  static hwaddr get_hphys(CPUState *cs, hwaddr gphys, MMUAccessType 
> access_type,
>  int *prot)
>  {
>  CPUX86State *env = _CPU(cs)->env;
> -uint64_t rsvd_mask = PG_HI_RSVD_MASK;
> +uint64_t rsvd_mask;
>  uint64_t ptep, pte;
>  uint64_t exit_info_1 = 0;
>  target_ulong pde_addr, pte_addr;
> @@ -172,6 +206,8 @@ static hwaddr get_hphys(CPUState *cs, hwaddr gphys, 
> MMUAccessType access_type,
>  return gphys;
>  }
>  
> + restart:
> +rsvd_mask = PG_HI_RSVD_MASK;
>  if (!(env->nested_pg_mode & SVM_NPT_NXE)) {
>  rsvd_mask |= PG_NX_MASK;
>  }
> @@ -198,8 +234,10 @@ static hwaddr get_hphys(CPUState *cs, hwaddr gphys, 
> MMUAccessType access_type,
>  goto do_fault_rsvd;
>  }
>  if (!(pml4e & PG_ACCESSED_MASK)) {
> -pml4e |= PG_ACCESSED_MASK;
> -x86_stl_phys_notdirty(cs, pml4e_addr, pml4e);
> +pml4e = update_entry(cs, pml4e_addr, pml4e, 
> PG_ACCESSED_MASK);
> +if (!pml4e) {
> +goto restart;
> +}
>  }
>  ptep &= pml4e ^ PG_NX_MASK;
>  pdpe_addr = (pml4e & PG_ADDRESS_MASK) +
> @@ -213,8 +251,10 @@ static hwaddr get_hphys(CPUState *cs, hwaddr gphys, 
> MMUAccessType access_type,
>  }
>  ptep &= pdpe ^ PG_NX_MASK;
>  if (!(pdpe & PG_ACCESSED_MASK)) {
> -pdpe |= PG_ACCESSED_MASK;
> -x86_stl_phys_notdirty(cs, pdpe_addr, pdpe);
> +pdpe = update_entry(cs, pdpe_addr, pdpe, PG_ACCESSED_MASK);
> +if (!pdpe) {
> +goto restart;
> +}
>  }
>  if (pdpe & PG_PSE_MASK) {
>  /* 1 GB page */
> @@ -256,8 +296,10 @@ static hwaddr get_hphys(CPUState *cs, hwaddr gphys, 
> MMUAccessType access_type,
>  }
>  /* 4 KB page */
>  if (!(pde & PG_ACCESSED_MASK)) {
> -pde |= PG_ACCESSED_MASK;
> -x86_stl_phys_notdirty(cs, pde_addr, pde);
> +pde = update_entry(cs, pde_addr, pde, PG_ACCESSED_MASK);
> +if (!pde) {
> +goto restart;
> +}
>  

Re: [Qemu-devel] [PATCH 3/3] ppc: Fix radix RC updates

2018-12-13 Thread Benjamin Herrenschmidt
On Fri, 2018-12-14 at 10:58 +1100, Benjamin Herrenschmidt wrote:
> They should be atomic for MTTCG. Note: a real POWER9 core doesn't actually
> implement atomic PTE updates, it always fault for SW to handle it. Only
> the nest MMU (used by some accelerator devices and GPUs) implements
> those HW updates.
> 
> However, the architecture does allow the core to do it, and doing so
> in TCG is faster than letting the guest do it.

Note: ppc hash MMU needs fixes too but of a different nature. I have
some queued up as well, but as-is, they are entangled with other ppc
target fixes, so I'll send them separately via David.

Cheers,
Ben.

> Signed-off-by: Benjamin Herrenschmidt 
> ---
>  target/ppc/cpu.h |  1 +
>  target/ppc/mmu-radix64.c | 70 +---
>  2 files changed, 59 insertions(+), 12 deletions(-)
> 
> diff --git a/target/ppc/cpu.h b/target/ppc/cpu.h
> index ab68abe8a2..afdef2af2f 100644
> --- a/target/ppc/cpu.h
> +++ b/target/ppc/cpu.h
> @@ -493,6 +493,7 @@ struct ppc_slb_t {
>  #define DSISR_AMR0x0020
>  /* Unsupported Radix Tree Configuration */
>  #define DSISR_R_BADCONFIG0x0008
> +#define DSISR_ATOMIC_RC  0x0004
>  
>  /* SRR1 error code fields */
>  
> diff --git a/target/ppc/mmu-radix64.c b/target/ppc/mmu-radix64.c
> index ab76cbc835..dba95aabdc 100644
> --- a/target/ppc/mmu-radix64.c
> +++ b/target/ppc/mmu-radix64.c
> @@ -28,6 +28,15 @@
>  #include "mmu-radix64.h"
>  #include "mmu-book3s-v3.h"
>  
> +static inline bool ppc_radix64_hw_rc_updates(CPUPPCState *env)
> +{
> +#ifdef CONFIG_ATOMIC64
> +return true;
> +#else
> +return !qemu_tcg_mttcg_enabled();
> +#endif
> +}
> +
>  static bool ppc_radix64_get_fully_qualified_addr(CPUPPCState *env, vaddr 
> eaddr,
>   uint64_t *lpid, uint64_t 
> *pid)
>  {
> @@ -120,11 +129,18 @@ static bool ppc_radix64_check_prot(PowerPCCPU *cpu, int 
> rwx, uint64_t pte,
>  return true;
>  }
>  
> +/* Check RC bits if necessary */
> +if (!ppc_radix64_hw_rc_updates(env)) {
> +if (!(pte & R_PTE_R) || ((rwx == 1) && !(pte & R_PTE_C))) {
> +*fault_cause |= DSISR_ATOMIC_RC;
> +return true;
> +}
> +}
> +
>  return false;
>  }
>  
> -static void ppc_radix64_set_rc(PowerPCCPU *cpu, int rwx, uint64_t pte,
> -   hwaddr pte_addr, int *prot)
> +static uint64_t ppc_radix64_set_rc(PowerPCCPU *cpu, int rwx, uint64_t pte, 
> hwaddr pte_addr)
>  {
>  CPUState *cs = CPU(cpu);
>  uint64_t npte;
> @@ -133,17 +149,38 @@ static void ppc_radix64_set_rc(PowerPCCPU *cpu, int 
> rwx, uint64_t pte,
>  
>  if (rwx == 1) { /* Store/Write */
>  npte |= R_PTE_C; /* Set change bit */
> -} else {
> -/*
> - * Treat the page as read-only for now, so that a later write
> - * will pass through this function again to set the C bit.
> - */
> -*prot &= ~PAGE_WRITE;
>  }
> +if (pte == npte) {
> +return pte;
> +}
> +
> +#ifdef CONFIG_ATOMIC64
> +if (qemu_tcg_mttcg_enabled()) {
> +uint64_t old_be = cpu_to_be32(pte);
> +uint64_t new_be = cpu_to_be32(npte);
> +MemTxResult result;
> +uint64_t old_ret;
> +
> +old_ret = address_space_cmpxchgq_notdirty(cs->as, pte_addr,
> +  old_be, new_be,
> +  MEMTXATTRS_UNSPECIFIED,
> +  );
> +if (result == MEMTX_OK) {
> +if (old_ret != old_be && old_ret != new_be) {
> +return 0;
> +}
> +return npte;
> +}
>  
> -if (pte ^ npte) { /* If pte has changed then write it back */
> -stq_phys(cs->as, pte_addr, npte);
> +/* Do we need to support this case where PTEs aren't in RAM ?
> + *
> + * For now fallback to non-atomic case
> + */
>  }
> +#endif
> +
> +stq_phys(cs->as, pte_addr, npte);
> +return npte;
>  }
>  
>  static uint64_t ppc_radix64_walk_tree(PowerPCCPU *cpu, vaddr eaddr,
> @@ -234,6 +271,7 @@ int ppc_radix64_handle_mmu_fault(PowerPCCPU *cpu, vaddr 
> eaddr, int rwx,
>  
>  /* Walk Radix Tree from Process Table Entry to Convert EA to RA */
>  page_size = PRTBE_R_GET_RTS(prtbe0);
> + restart:
>  pte = ppc_radix64_walk_tree(cpu, eaddr & R_EADDR_MASK,
>  prtbe0 & PRTBE_R_RPDB, prtb

[Qemu-devel] [PATCH 3/3] ppc: Fix radix RC updates

2018-12-13 Thread Benjamin Herrenschmidt
They should be atomic for MTTCG. Note: a real POWER9 core doesn't actually
implement atomic PTE updates, it always fault for SW to handle it. Only
the nest MMU (used by some accelerator devices and GPUs) implements
those HW updates.

However, the architecture does allow the core to do it, and doing so
in TCG is faster than letting the guest do it.

Signed-off-by: Benjamin Herrenschmidt 
---
 target/ppc/cpu.h |  1 +
 target/ppc/mmu-radix64.c | 70 +---
 2 files changed, 59 insertions(+), 12 deletions(-)

diff --git a/target/ppc/cpu.h b/target/ppc/cpu.h
index ab68abe8a2..afdef2af2f 100644
--- a/target/ppc/cpu.h
+++ b/target/ppc/cpu.h
@@ -493,6 +493,7 @@ struct ppc_slb_t {
 #define DSISR_AMR0x0020
 /* Unsupported Radix Tree Configuration */
 #define DSISR_R_BADCONFIG0x0008
+#define DSISR_ATOMIC_RC  0x0004
 
 /* SRR1 error code fields */
 
diff --git a/target/ppc/mmu-radix64.c b/target/ppc/mmu-radix64.c
index ab76cbc835..dba95aabdc 100644
--- a/target/ppc/mmu-radix64.c
+++ b/target/ppc/mmu-radix64.c
@@ -28,6 +28,15 @@
 #include "mmu-radix64.h"
 #include "mmu-book3s-v3.h"
 
+static inline bool ppc_radix64_hw_rc_updates(CPUPPCState *env)
+{
+#ifdef CONFIG_ATOMIC64
+return true;
+#else
+return !qemu_tcg_mttcg_enabled();
+#endif
+}
+
 static bool ppc_radix64_get_fully_qualified_addr(CPUPPCState *env, vaddr eaddr,
  uint64_t *lpid, uint64_t *pid)
 {
@@ -120,11 +129,18 @@ static bool ppc_radix64_check_prot(PowerPCCPU *cpu, int 
rwx, uint64_t pte,
 return true;
 }
 
+/* Check RC bits if necessary */
+if (!ppc_radix64_hw_rc_updates(env)) {
+if (!(pte & R_PTE_R) || ((rwx == 1) && !(pte & R_PTE_C))) {
+*fault_cause |= DSISR_ATOMIC_RC;
+return true;
+}
+}
+
 return false;
 }
 
-static void ppc_radix64_set_rc(PowerPCCPU *cpu, int rwx, uint64_t pte,
-   hwaddr pte_addr, int *prot)
+static uint64_t ppc_radix64_set_rc(PowerPCCPU *cpu, int rwx, uint64_t pte, 
hwaddr pte_addr)
 {
 CPUState *cs = CPU(cpu);
 uint64_t npte;
@@ -133,17 +149,38 @@ static void ppc_radix64_set_rc(PowerPCCPU *cpu, int rwx, 
uint64_t pte,
 
 if (rwx == 1) { /* Store/Write */
 npte |= R_PTE_C; /* Set change bit */
-} else {
-/*
- * Treat the page as read-only for now, so that a later write
- * will pass through this function again to set the C bit.
- */
-*prot &= ~PAGE_WRITE;
 }
+if (pte == npte) {
+return pte;
+}
+
+#ifdef CONFIG_ATOMIC64
+if (qemu_tcg_mttcg_enabled()) {
+uint64_t old_be = cpu_to_be32(pte);
+uint64_t new_be = cpu_to_be32(npte);
+MemTxResult result;
+uint64_t old_ret;
+
+old_ret = address_space_cmpxchgq_notdirty(cs->as, pte_addr,
+  old_be, new_be,
+  MEMTXATTRS_UNSPECIFIED,
+  );
+if (result == MEMTX_OK) {
+if (old_ret != old_be && old_ret != new_be) {
+return 0;
+}
+return npte;
+}
 
-if (pte ^ npte) { /* If pte has changed then write it back */
-stq_phys(cs->as, pte_addr, npte);
+/* Do we need to support this case where PTEs aren't in RAM ?
+ *
+ * For now fallback to non-atomic case
+ */
 }
+#endif
+
+stq_phys(cs->as, pte_addr, npte);
+return npte;
 }
 
 static uint64_t ppc_radix64_walk_tree(PowerPCCPU *cpu, vaddr eaddr,
@@ -234,6 +271,7 @@ int ppc_radix64_handle_mmu_fault(PowerPCCPU *cpu, vaddr 
eaddr, int rwx,
 
 /* Walk Radix Tree from Process Table Entry to Convert EA to RA */
 page_size = PRTBE_R_GET_RTS(prtbe0);
+ restart:
 pte = ppc_radix64_walk_tree(cpu, eaddr & R_EADDR_MASK,
 prtbe0 & PRTBE_R_RPDB, prtbe0 & PRTBE_R_RPDS,
 , _size, _cause, _addr);
@@ -244,8 +282,16 @@ int ppc_radix64_handle_mmu_fault(PowerPCCPU *cpu, vaddr 
eaddr, int rwx,
 }
 
 /* Update Reference and Change Bits */
-ppc_radix64_set_rc(cpu, rwx, pte, pte_addr, );
-
+if (ppc_radix64_hw_rc_updates(env)) {
+pte = ppc_radix64_set_rc(cpu, rwx, pte, pte_addr);
+if (!pte) {
+goto restart;
+}
+}
+/* If the page doesn't have C, treat it as read only */
+if (!(pte & R_PTE_C)) {
+prot &= ~PAGE_WRITE;
+}
 tlb_set_page(cs, eaddr & TARGET_PAGE_MASK, raddr & TARGET_PAGE_MASK,
  prot, mmu_idx, 1UL << page_size);
 return 0;
-- 
2.19.2




[Qemu-devel] [PATCH 1/3] memory_ldst: Add atomic ops for PTE updates

2018-12-13 Thread Benjamin Herrenschmidt
On some architectures, PTE updates for dirty and changed bits need
to be performed atomically. This adds a couple of address_space_cmpxchg*
helpers for that purpose.

Signed-off-by: Benjamin Herrenschmidt 
---
 include/exec/memory_ldst.inc.h |  6 +++
 memory_ldst.inc.c  | 78 ++
 2 files changed, 84 insertions(+)

diff --git a/include/exec/memory_ldst.inc.h b/include/exec/memory_ldst.inc.h
index 272c20f02e..f3cfa7e9a6 100644
--- a/include/exec/memory_ldst.inc.h
+++ b/include/exec/memory_ldst.inc.h
@@ -28,6 +28,12 @@ extern uint64_t glue(address_space_ldq, SUFFIX)(ARG1_DECL,
 hwaddr addr, MemTxAttrs attrs, MemTxResult *result);
 extern void glue(address_space_stl_notdirty, SUFFIX)(ARG1_DECL,
 hwaddr addr, uint32_t val, MemTxAttrs attrs, MemTxResult *result);
+extern uint32_t glue(address_space_cmpxchgl_notdirty, SUFFIX)(ARG1_DECL,
+hwaddr addr, uint32_t old, uint32_t new, MemTxAttrs attrs,
+MemTxResult *result);
+extern uint32_t glue(address_space_cmpxchgq_notdirty, SUFFIX)(ARG1_DECL,
+hwaddr addr, uint64_t old, uint64_t new, MemTxAttrs attrs,
+MemTxResult *result);
 extern void glue(address_space_stw, SUFFIX)(ARG1_DECL,
 hwaddr addr, uint32_t val, MemTxAttrs attrs, MemTxResult *result);
 extern void glue(address_space_stl, SUFFIX)(ARG1_DECL,
diff --git a/memory_ldst.inc.c b/memory_ldst.inc.c
index acf865b900..7ab6de37ba 100644
--- a/memory_ldst.inc.c
+++ b/memory_ldst.inc.c
@@ -320,6 +320,84 @@ void glue(address_space_stl_notdirty, SUFFIX)(ARG1_DECL,
 RCU_READ_UNLOCK();
 }
 
+/* This is meant to be used for atomic PTE updates under MT-TCG */
+uint32_t glue(address_space_cmpxchgl_notdirty, SUFFIX)(ARG1_DECL,
+hwaddr addr, uint32_t old, uint32_t new, MemTxAttrs attrs, MemTxResult 
*result)
+{
+uint8_t *ptr;
+MemoryRegion *mr;
+hwaddr l = 4;
+hwaddr addr1;
+MemTxResult r;
+uint8_t dirty_log_mask;
+
+/* Must test result */
+assert(result);
+
+RCU_READ_LOCK();
+mr = TRANSLATE(addr, , , true, attrs);
+if (l < 4 || !memory_access_is_direct(mr, true)) {
+r = MEMTX_ERROR;
+} else {
+uint32_t orig = old;
+
+ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
+old = atomic_cmpxchg(ptr, orig, new);
+
+if (old == orig) {
+dirty_log_mask = memory_region_get_dirty_log_mask(mr);
+dirty_log_mask &= ~(1 << DIRTY_MEMORY_CODE);
+cpu_physical_memory_set_dirty_range(memory_region_get_ram_addr(mr) 
+ addr,
+4, dirty_log_mask);
+}
+r = MEMTX_OK;
+}
+*result = r;
+RCU_READ_UNLOCK();
+
+return old;
+}
+
+#ifdef CONFIG_ATOMIC64
+/* This is meant to be used for atomic PTE updates under MT-TCG */
+uint32_t glue(address_space_cmpxchgq_notdirty, SUFFIX)(ARG1_DECL,
+hwaddr addr, uint64_t old, uint64_t new, MemTxAttrs attrs, MemTxResult 
*result)
+{
+uint8_t *ptr;
+MemoryRegion *mr;
+hwaddr l = 8;
+hwaddr addr1;
+MemTxResult r;
+uint8_t dirty_log_mask;
+
+/* Must test result */
+assert(result);
+
+RCU_READ_LOCK();
+mr = TRANSLATE(addr, , , true, attrs);
+if (l < 8 || !memory_access_is_direct(mr, true)) {
+r = MEMTX_ERROR;
+} else {
+uint32_t orig = old;
+
+ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
+old = atomic_cmpxchg(ptr, orig, new);
+
+if (old == orig) {
+dirty_log_mask = memory_region_get_dirty_log_mask(mr);
+dirty_log_mask &= ~(1 << DIRTY_MEMORY_CODE);
+cpu_physical_memory_set_dirty_range(memory_region_get_ram_addr(mr) 
+ addr,
+8, dirty_log_mask);
+}
+r = MEMTX_OK;
+}
+*result = r;
+RCU_READ_UNLOCK();
+
+return old;
+}
+#endif /* CONFIG_ATOMIC64 */
+
 /* warning: addr must be aligned */
 static inline void glue(address_space_stl_internal, SUFFIX)(ARG1_DECL,
 hwaddr addr, uint32_t val, MemTxAttrs attrs,
-- 
2.19.2




[Qemu-devel] [PATCH 2/3] i386: Atomically update PTEs with mttcg

2018-12-13 Thread Benjamin Herrenschmidt
Afaik, this isn't well documented (at least it wasn't when I last looked)
but OSes such as Linux rely on this behaviour:

The HW updates to the page tables need to be done atomically with the
checking of the present bit (and other permissions).

This is what allows Linux to do simple xchg of PTEs with 0 and assume
the value read has "final" stable dirty and accessed bits (the TLB
invalidation is deferred).

Signed-off-by: Benjamin Herrenschmidt 
---
 target/i386/excp_helper.c | 104 +-
 1 file changed, 80 insertions(+), 24 deletions(-)

diff --git a/target/i386/excp_helper.c b/target/i386/excp_helper.c
index 49231f6b69..93fc24c011 100644
--- a/target/i386/excp_helper.c
+++ b/target/i386/excp_helper.c
@@ -157,11 +157,45 @@ int x86_cpu_handle_mmu_fault(CPUState *cs, vaddr addr, 
int size,
 
 #else
 
+static inline uint64_t update_entry(CPUState *cs, target_ulong addr,
+uint64_t orig_entry, uint32_t bits)
+{
+uint64_t new_entry = orig_entry | bits;
+
+/* Write the updated bottom 32-bits */
+if (qemu_tcg_mttcg_enabled()) {
+uint32_t old_le = cpu_to_le32(orig_entry);
+uint32_t new_le = cpu_to_le32(new_entry);
+MemTxResult result;
+uint32_t old_ret;
+
+old_ret = address_space_cmpxchgl_notdirty(cs->as, addr,
+  old_le, new_le,
+  MEMTXATTRS_UNSPECIFIED,
+  );
+if (result == MEMTX_OK) {
+if (old_ret != old_le && old_ret != new_le) {
+new_entry = 0;
+}
+return new_entry;
+}
+
+/* Do we need to support this case where PTEs aren't in RAM ?
+ *
+ * For now fallback to non-atomic case
+ */
+}
+
+x86_stl_phys_notdirty(cs, addr, new_entry);
+
+return new_entry;
+}
+
 static hwaddr get_hphys(CPUState *cs, hwaddr gphys, MMUAccessType access_type,
 int *prot)
 {
 CPUX86State *env = _CPU(cs)->env;
-uint64_t rsvd_mask = PG_HI_RSVD_MASK;
+uint64_t rsvd_mask;
 uint64_t ptep, pte;
 uint64_t exit_info_1 = 0;
 target_ulong pde_addr, pte_addr;
@@ -172,6 +206,8 @@ static hwaddr get_hphys(CPUState *cs, hwaddr gphys, 
MMUAccessType access_type,
 return gphys;
 }
 
+ restart:
+rsvd_mask = PG_HI_RSVD_MASK;
 if (!(env->nested_pg_mode & SVM_NPT_NXE)) {
 rsvd_mask |= PG_NX_MASK;
 }
@@ -198,8 +234,10 @@ static hwaddr get_hphys(CPUState *cs, hwaddr gphys, 
MMUAccessType access_type,
 goto do_fault_rsvd;
 }
 if (!(pml4e & PG_ACCESSED_MASK)) {
-pml4e |= PG_ACCESSED_MASK;
-x86_stl_phys_notdirty(cs, pml4e_addr, pml4e);
+pml4e = update_entry(cs, pml4e_addr, pml4e, PG_ACCESSED_MASK);
+if (!pml4e) {
+goto restart;
+}
 }
 ptep &= pml4e ^ PG_NX_MASK;
 pdpe_addr = (pml4e & PG_ADDRESS_MASK) +
@@ -213,8 +251,10 @@ static hwaddr get_hphys(CPUState *cs, hwaddr gphys, 
MMUAccessType access_type,
 }
 ptep &= pdpe ^ PG_NX_MASK;
 if (!(pdpe & PG_ACCESSED_MASK)) {
-pdpe |= PG_ACCESSED_MASK;
-x86_stl_phys_notdirty(cs, pdpe_addr, pdpe);
+pdpe = update_entry(cs, pdpe_addr, pdpe, PG_ACCESSED_MASK);
+if (!pdpe) {
+goto restart;
+}
 }
 if (pdpe & PG_PSE_MASK) {
 /* 1 GB page */
@@ -256,8 +296,10 @@ static hwaddr get_hphys(CPUState *cs, hwaddr gphys, 
MMUAccessType access_type,
 }
 /* 4 KB page */
 if (!(pde & PG_ACCESSED_MASK)) {
-pde |= PG_ACCESSED_MASK;
-x86_stl_phys_notdirty(cs, pde_addr, pde);
+pde = update_entry(cs, pde_addr, pde, PG_ACCESSED_MASK);
+if (!pde) {
+goto restart;
+}
 }
 pte_addr = (pde & PG_ADDRESS_MASK) + (((gphys >> 12) & 0x1ff) << 3);
 pte = x86_ldq_phys(cs, pte_addr);
@@ -295,8 +337,10 @@ static hwaddr get_hphys(CPUState *cs, hwaddr gphys, 
MMUAccessType access_type,
 }
 
 if (!(pde & PG_ACCESSED_MASK)) {
-pde |= PG_ACCESSED_MASK;
-x86_stl_phys_notdirty(cs, pde_addr, pde);
+pde = update_entry(cs, pde_addr, pde, PG_ACCESSED_MASK);
+if (!pde) {
+goto restart;
+}
 }
 
 /* page directory entry */
@@ -376,7 +420,7 @@ int x86_cpu_handle_mmu_fault(CPUState *cs, vaddr addr, int 
size,
 int error_code = 0;
 int is_dirty, prot, page_size, is_write, is_user;
 hwaddr paddr;
-uint64_t rsvd_mask = PG_HI_RSVD_MASK;
+uint64_t rsvd_

Re: [Qemu-devel] [RFC/PATCH] i386: Atomically update PTEs with mttcg

2018-12-13 Thread Benjamin Herrenschmidt
On Thu, 2018-11-29 at 11:04 -0800, Richard Henderson wrote:
> 
> While I know the existing code just updates the low 32-bits, I'd rather update
> the whole entry, and make use of the old value returned from the cmpxchg.

So I had to put this on the back burner for a bit, but I'm back to it
now.

I've tried the above but it gets messy again. The i386 code can have
either 32 or 64 bits PDE/PTEs. The udpate code for the PTE is designed
so a single code path can work with either by exploiting that LE trick,
so changing that would involve more re-factoring and I'd rather avoid
changing more than strictly needed in there.

As for this:

>pdpe = x86_ldq_phys(cs, pdpe_addr);
> do {
> if (!(pdpe & PG_PRESENT_MASK)) {
> goto do_fault;
> }
> if (pdpe & rsvd_mask) {
> goto do_fault_rsvd;
> }
> if (pdpe & PG_ACCESSED_MASK) {
> break;
> }
> } while (!update_entry(cs, pdpe_addr, , PG_ACCESSED_MASK));
> ptep &= pdpe ^ PG_NX_MASK;

While that form of construct is nicer than the current goto restart in
my patch, in a similar way, it works for the intermediary cases but
would require some serious refactoring of the whole function for the
final PTE case.

At this point I'd rather not take chances and introduce more bugs, so
I'll post an updated variant with a few fixes but will postpone more
refactoring unless you really really want them now :)

Cheers,
Ben.




Re: [Qemu-devel] [PATCH v7 17/19] spapr: Add a pseries-4.0 machine type

2018-12-09 Thread Benjamin Herrenschmidt
On Sun, 2018-12-09 at 20:46 +0100, Cédric Le Goater wrote:
> Signed-off-by: Cédric Le Goater 
> ---

If you're going to do that, can we include large decrementer in there
too ? (patches from Suraj in my tree but they night need a bit of
massaging).

>  include/hw/compat.h |  3 +++
>  hw/ppc/spapr.c  | 25 ++---
>  2 files changed, 25 insertions(+), 3 deletions(-)
> 
> diff --git a/include/hw/compat.h b/include/hw/compat.h
> index 6f4d5fc64704..70958328fe7a 100644
> --- a/include/hw/compat.h
> +++ b/include/hw/compat.h
> @@ -1,6 +1,9 @@
>  #ifndef HW_COMPAT_H
>  #define HW_COMPAT_H
>  
> +#define HW_COMPAT_3_1 \
> +/* empty */
> +
>  #define HW_COMPAT_3_0 \
>  /* empty */
>  
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index fa41927d95dd..4012ebd794a4 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -3971,19 +3971,38 @@ static const TypeInfo spapr_machine_info = {
>  }\
>  type_init(spapr_machine_register_##suffix)
>  
> - /*
> +/*
> + * pseries-4.0
> + */
> +static void spapr_machine_4_0_instance_options(MachineState *machine)
> +{
> +}
> +
> +static void spapr_machine_4_0_class_options(MachineClass *mc)
> +{
> +/* Defaults for the latest behaviour inherited from the base class */
> +}
> +
> +DEFINE_SPAPR_MACHINE(4_0, "4.0", true);
> +
> +/*
>   * pseries-3.1
>   */
> +#define SPAPR_COMPAT_3_1  \
> +HW_COMPAT_3_1
> +
>  static void spapr_machine_3_1_instance_options(MachineState *machine)
>  {
> +spapr_machine_4_0_instance_options(machine);
>  }
>  
>  static void spapr_machine_3_1_class_options(MachineClass *mc)
>  {
> -/* Defaults for the latest behaviour inherited from the base class */
> +spapr_machine_4_0_class_options(mc);
> +SET_MACHINE_COMPAT(mc, SPAPR_COMPAT_3_1);
>  }
>  
> -DEFINE_SPAPR_MACHINE(3_1, "3.1", true);
> +DEFINE_SPAPR_MACHINE(3_1, "3.1", false);
>  
>  /*
>   * pseries-3.0




Re: [Qemu-devel] [RFC/PATCH] i386: Atomically update PTEs with mttcg

2018-11-29 Thread Benjamin Herrenschmidt
On Thu, 2018-11-29 at 16:12 -0800, Richard Henderson wrote:
> > You mean translating once for the load and for the udpate ? Do you
> > expect that translation to have such a significant cost considering
> > that all it needs should be in L1 at that point ?
> 
> I guess if the update is rare-ish, the re-translating isn't a big deal.  And I
> suppose we'd have to retain the RCU lock to hold on to the translation, which
> probably isn't the best idea.

Yeay I toyed with this a bit but it ended up very messy real quick.

Cheers,
Ben.




Re: [Qemu-devel] [RFC/PATCH] i386: Atomically update PTEs with mttcg

2018-11-29 Thread Benjamin Herrenschmidt
On Thu, 2018-11-29 at 16:12 -0800, Richard Henderson wrote:
> On 11/29/18 2:54 PM, Benjamin Herrenschmidt wrote:
> > >  pdpe_addr = (pml4e & PG_ADDRESS_MASK) +
> > >  (((gphys >> 30) & 0x1ff) << 3);
> > >  pdpe = x86_ldq_phys(cs, pdpe_addr);
> > >  do {
> > >  if (!(pdpe & PG_PRESENT_MASK)) {
> > >  goto do_fault;
> > >  }
> > >  if (pdpe & rsvd_mask) {
> > >  goto do_fault_rsvd;
> > >  }
> > >  if (pdpe & PG_ACCESSED_MASK) {
> > >  break;
> > >  }
> > >  } while (!update_entry(cs, pdpe_addr, , PG_ACCESSED_MASK));
> > >  ptep &= pdpe ^ PG_NX_MASK;
> > > 
> > > 
> > 
> > Hrm.. I see. So not re-do the full walk. Not sure it's really worth it
> > though, how often do we expect to hit the failing case ?
> 
> It is probably rare-ish, I admit.
> 
> I suppose we could also signal "success" from update_entry if the cmpxchg
> fails, but the value that was reloaded only differs in setting 
> PG_ACCESSED_MASK

The latter optimization is trivial. As for the former, replacing my
"goto restart" with those loops, it will make the patch significantly
bigger, though not adding another goto has its perks and the end result
might be slightly nicer ...

What way do you prefer ?

Cheers,
Ben.




Re: [Qemu-devel] [RFC/PATCH] i386: Atomically update PTEs with mttcg

2018-11-29 Thread Benjamin Herrenschmidt
On Thu, 2018-11-29 at 11:04 -0800, Richard Henderson wrote:
> On 11/28/18 3:15 PM, Benjamin Herrenschmidt wrote:
> > +static inline uint64_t update_entry(CPUState *cs, target_ulong addr,
> > +uint64_t orig_entry, uint32_t bits)
> > +{
> > +uint64_t new_entry = orig_entry | bits;
> > +
> > +/* Write the updated bottom 32-bits */
> > +if (qemu_tcg_mttcg_enabled()) {
> > +uint32_t old_le = cpu_to_le32(orig_entry);
> > +uint32_t new_le = cpu_to_le32(new_entry);
> > +MemTxResult result;
> > +uint32_t old_ret;
> > +
> > +old_ret = address_space_cmpxchgl_notdirty(cs->as, addr,
> > +  old_le, new_le,
> > +  MEMTXATTRS_UNSPECIFIED,
> > +  );
> > +if (result == MEMTX_OK) {
> > +if (old_ret != old_le) {
> > +new_entry = 0;
> > +}
> > +return new_entry;
> > +}
> > +
> > +/* Do we need to support this case where PTEs aren't in RAM ?
> > + *
> > + * For now fallback to non-atomic case
> > + */
> > +}
> > +
> > +x86_stl_phys_notdirty(cs, addr, new_entry);
> > +
> > +return new_entry;
> > +}
> 
> While I know the existing code just updates the low 32-bits, I'd rather update
> the whole entry, and make use of the old value returned from the cmpxchg.

Not all PTEs on x86 ae 64-bits, the old 32-bit format still exists, so
we'd have to create two helpers. Not a big deal mind you. I'm already
creating a second address_space_cmpxchgq_notdirty for use by powerpc

> static bool update_entry(CPUState *cs, target_ulong addr,
>  uint64_t *entry, uint32_t bits)
> {
> uint64_t old_entry = *entry;
> uint64_t new_entry = old_entry | bits;
> 
> if (qemu_tcg_mttcg_enabled()) {
> uint64_t cmp_le = cpu_to_le64(old_entry);
> uint64_t new_le = cpu_to_le64(new_entry);
> uint64_t old_le;
> MemTxResult result;
> 
> old_le = address_space_cmpxchgl_notdirty(cs->as, addr,
>  cmp_le, new_le,
>  MEMTXATTRS_UNSPECIFIED,
>  );
> if (result == MEMTX_OK) {
> if (old_le == cmp_le) {
> return true;
> } else {
> *entry = le_to_cpu64(old_le);
> return false;
> }
> }
> }
> x86_stq_phys_notdirty(cs, addr, new_entry);
> return true;
> }


> 
> 
> pdpe_addr = (pml4e & PG_ADDRESS_MASK) +
> (((gphys >> 30) & 0x1ff) << 3);
> pdpe = x86_ldq_phys(cs, pdpe_addr);
> do {
> if (!(pdpe & PG_PRESENT_MASK)) {
> goto do_fault;
> }
> if (pdpe & rsvd_mask) {
> goto do_fault_rsvd;
> }
> if (pdpe & PG_ACCESSED_MASK) {
> break;
> }
> } while (!update_entry(cs, pdpe_addr, , PG_ACCESSED_MASK));
> ptep &= pdpe ^ PG_NX_MASK;
> 
> 

Hrm.. I see. So not re-do the full walk. Not sure it's really worth it
though, how often do we expect to hit the failing case ?

> 
> Although I think it would be really great if we could figure out something 
> that
> allows us to promote this whole load/cmpxchg loop into a primitive that avoids
> multiple translations of the address.
> 
> No, I don't know what that primitive would look like.  :-)

You mean translating once for the load and for the udpate ? Do you
expect that translation to have such a significant cost considering
that all it needs should be in L1 at that point ?

There are subtle differences in what happens between the load and the
update, and we hope for the update to be fairly rare (even though it's
done by HW it's costly even on actual chips).

So best way I could think of would be to use a macro, so we can open-
code the statements that test/manipulate the value.

Not sure I'll get to it today though, I'm off early this afternoon for
the week-end.

Cheers,
Ben.
> 
> r~




Re: [Qemu-devel] [RFC/PATCH] i386: Atomically update PTEs with mttcg

2018-11-29 Thread Benjamin Herrenschmidt
On Thu, 2018-11-29 at 10:26 +0100, Paolo Bonzini wrote:
> On 29/11/18 00:15, Benjamin Herrenschmidt wrote:
> > Afaik, this isn't well documented (at least it wasn't when I last looked)
> > but OSes such as Linux rely on this behaviour:
> > 
> > The HW updates to the page tables need to be done atomically with the
> > checking of the present bit (and other permissions).
> > 
> > This is what allows Linux to do simple xchg of PTEs with 0 and assume
> > the value read has "final" stable dirty and accessed bits (the TLB
> > invalidation is deferred).
> > 
> > Signed-off-by: Benjamin Herrenschmidt 
> > ---
> > 
> > This is lightly tested at this point, mostly to gather comments on the
> > approach.
> 
> Looks good, but please kill the notdirty stuff first.  It's not needed
> and it's a clear bug.  When migrating, it can lead to PTEs being
> migrated without accessed and dirty bits.

I thought that too but looking closely with rth, it seems it's still
setting *those* dirty bits, it just avoids the collision test with the
translator. Unless I missed something...

Do you still want to kill them ? They are warty, no doubt...
 
For powerpc I need a cmpxchgq variant too, I'll probably split the
patch adding those mem ops from the rest as well.

Annoyingly, fixing riscv will need some tests on target_ulong size.

Cheers,
Ben.

> Paolo
> 
> > I noticed that RiscV is attempting to do something similar but with endian
> > bugs, at least from my reading of the code.
> > 
> >  include/exec/memory_ldst.inc.h |   3 +
> >  memory_ldst.inc.c  |  38 
> >  target/i386/excp_helper.c  | 104 +
> >  3 files changed, 121 insertions(+), 24 deletions(-)
> > 
> > diff --git a/include/exec/memory_ldst.inc.h b/include/exec/memory_ldst.inc.h
> > index 272c20f02e..b7a41a0ad4 100644
> > --- a/include/exec/memory_ldst.inc.h
> > +++ b/include/exec/memory_ldst.inc.h
> > @@ -28,6 +28,9 @@ extern uint64_t glue(address_space_ldq, SUFFIX)(ARG1_DECL,
> >  hwaddr addr, MemTxAttrs attrs, MemTxResult *result);
> >  extern void glue(address_space_stl_notdirty, SUFFIX)(ARG1_DECL,
> >  hwaddr addr, uint32_t val, MemTxAttrs attrs, MemTxResult *result);
> > +extern uint32_t glue(address_space_cmpxchgl_notdirty, SUFFIX)(ARG1_DECL,
> > +hwaddr addr, uint32_t old, uint32_t new, MemTxAttrs attrs,
> > +MemTxResult *result);
> >  extern void glue(address_space_stw, SUFFIX)(ARG1_DECL,
> >  hwaddr addr, uint32_t val, MemTxAttrs attrs, MemTxResult *result);
> >  extern void glue(address_space_stl, SUFFIX)(ARG1_DECL,
> > diff --git a/memory_ldst.inc.c b/memory_ldst.inc.c
> > index acf865b900..63af8f7dd2 100644
> > --- a/memory_ldst.inc.c
> > +++ b/memory_ldst.inc.c
> > @@ -320,6 +320,44 @@ void glue(address_space_stl_notdirty, 
> > SUFFIX)(ARG1_DECL,
> >  RCU_READ_UNLOCK();
> >  }
> >  
> > +/* This is meant to be used for atomic PTE updates under MT-TCG */
> > +uint32_t glue(address_space_cmpxchgl_notdirty, SUFFIX)(ARG1_DECL,
> > +hwaddr addr, uint32_t old, uint32_t new, MemTxAttrs attrs, MemTxResult 
> > *result)
> > +{
> > +uint8_t *ptr;
> > +MemoryRegion *mr;
> > +hwaddr l = 4;
> > +hwaddr addr1;
> > +MemTxResult r;
> > +uint8_t dirty_log_mask;
> > +
> > +/* Must test result */
> > +assert(result);
> > +
> > +RCU_READ_LOCK();
> > +mr = TRANSLATE(addr, , , true, attrs);
> > +if (l < 4 || !memory_access_is_direct(mr, true)) {
> > +r = MEMTX_ERROR;
> > +} else {
> > +uint32_t orig = old;
> > +
> > +ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
> > +old = atomic_cmpxchg(ptr, orig, new);
> > +
> > +if (old == orig) {
> > +dirty_log_mask = memory_region_get_dirty_log_mask(mr);
> > +dirty_log_mask &= ~(1 << DIRTY_MEMORY_CODE);
> > +
> > cpu_physical_memory_set_dirty_range(memory_region_get_ram_addr(mr) + addr,
> > +4, dirty_log_mask);
> > +}
> > +r = MEMTX_OK;
> > +}
> > +*result = r;
> > +RCU_READ_UNLOCK();
> > +
> > +return old;
> > +}
> > +
> >  /* warning: addr must be aligned */
> >  static inline void glue(address_space_stl_internal, SUFFIX)(ARG1_DECL,
> >  hwaddr addr, uint32_t val, MemTxAttrs attrs,
> > diff --git a/target/i386/excp_helper.c b/target/i386/excp_helper.c
> > index 49231f6

Re: [Qemu-devel] [PATCH v5 08/36] ppc/xive: introduce a simplified XIVE presenter

2018-11-28 Thread Benjamin Herrenschmidt
On Thu, 2018-11-29 at 11:47 +1100, David Gibson wrote:
> 
> 1) read/write accessors which take a word number
> 
> 2) A "get" accessor which copies the whole structure, but "write"
> accessor which takes a word number.  The asymmetry is a bit ugly, but
> it's the non-atomic writeback of the whole structure which I'm most
> uncomfortable with.

It shouldn't be a big deal though, there are HW facilities to access
the structures "atomically" anyway, due to the caching done by the
XIVE.

> 3) A map/unmap interface which gives you / releases a pointer to the
> "live" structure.  For powernv that would become
> address_space_map()/unmap().  For PAPR it would just be reutn pointer
> / no-op.




[Qemu-devel] [RFC/PATCH] i386: Atomically update PTEs with mttcg

2018-11-28 Thread Benjamin Herrenschmidt
Afaik, this isn't well documented (at least it wasn't when I last looked)
but OSes such as Linux rely on this behaviour:

The HW updates to the page tables need to be done atomically with the
checking of the present bit (and other permissions).

This is what allows Linux to do simple xchg of PTEs with 0 and assume
the value read has "final" stable dirty and accessed bits (the TLB
invalidation is deferred).

Signed-off-by: Benjamin Herrenschmidt 
---

This is lightly tested at this point, mostly to gather comments on the
approach.

I noticed that RiscV is attempting to do something similar but with endian
bugs, at least from my reading of the code.

 include/exec/memory_ldst.inc.h |   3 +
 memory_ldst.inc.c  |  38 
 target/i386/excp_helper.c  | 104 +
 3 files changed, 121 insertions(+), 24 deletions(-)

diff --git a/include/exec/memory_ldst.inc.h b/include/exec/memory_ldst.inc.h
index 272c20f02e..b7a41a0ad4 100644
--- a/include/exec/memory_ldst.inc.h
+++ b/include/exec/memory_ldst.inc.h
@@ -28,6 +28,9 @@ extern uint64_t glue(address_space_ldq, SUFFIX)(ARG1_DECL,
 hwaddr addr, MemTxAttrs attrs, MemTxResult *result);
 extern void glue(address_space_stl_notdirty, SUFFIX)(ARG1_DECL,
 hwaddr addr, uint32_t val, MemTxAttrs attrs, MemTxResult *result);
+extern uint32_t glue(address_space_cmpxchgl_notdirty, SUFFIX)(ARG1_DECL,
+hwaddr addr, uint32_t old, uint32_t new, MemTxAttrs attrs,
+MemTxResult *result);
 extern void glue(address_space_stw, SUFFIX)(ARG1_DECL,
 hwaddr addr, uint32_t val, MemTxAttrs attrs, MemTxResult *result);
 extern void glue(address_space_stl, SUFFIX)(ARG1_DECL,
diff --git a/memory_ldst.inc.c b/memory_ldst.inc.c
index acf865b900..63af8f7dd2 100644
--- a/memory_ldst.inc.c
+++ b/memory_ldst.inc.c
@@ -320,6 +320,44 @@ void glue(address_space_stl_notdirty, SUFFIX)(ARG1_DECL,
 RCU_READ_UNLOCK();
 }
 
+/* This is meant to be used for atomic PTE updates under MT-TCG */
+uint32_t glue(address_space_cmpxchgl_notdirty, SUFFIX)(ARG1_DECL,
+hwaddr addr, uint32_t old, uint32_t new, MemTxAttrs attrs, MemTxResult 
*result)
+{
+uint8_t *ptr;
+MemoryRegion *mr;
+hwaddr l = 4;
+hwaddr addr1;
+MemTxResult r;
+uint8_t dirty_log_mask;
+
+/* Must test result */
+assert(result);
+
+RCU_READ_LOCK();
+mr = TRANSLATE(addr, , , true, attrs);
+if (l < 4 || !memory_access_is_direct(mr, true)) {
+r = MEMTX_ERROR;
+} else {
+uint32_t orig = old;
+
+ptr = qemu_map_ram_ptr(mr->ram_block, addr1);
+old = atomic_cmpxchg(ptr, orig, new);
+
+if (old == orig) {
+dirty_log_mask = memory_region_get_dirty_log_mask(mr);
+dirty_log_mask &= ~(1 << DIRTY_MEMORY_CODE);
+cpu_physical_memory_set_dirty_range(memory_region_get_ram_addr(mr) 
+ addr,
+4, dirty_log_mask);
+}
+r = MEMTX_OK;
+}
+*result = r;
+RCU_READ_UNLOCK();
+
+return old;
+}
+
 /* warning: addr must be aligned */
 static inline void glue(address_space_stl_internal, SUFFIX)(ARG1_DECL,
 hwaddr addr, uint32_t val, MemTxAttrs attrs,
diff --git a/target/i386/excp_helper.c b/target/i386/excp_helper.c
index 49231f6b69..5ccb9d6d6a 100644
--- a/target/i386/excp_helper.c
+++ b/target/i386/excp_helper.c
@@ -157,11 +157,45 @@ int x86_cpu_handle_mmu_fault(CPUState *cs, vaddr addr, 
int size,
 
 #else
 
+static inline uint64_t update_entry(CPUState *cs, target_ulong addr,
+uint64_t orig_entry, uint32_t bits)
+{
+uint64_t new_entry = orig_entry | bits;
+
+/* Write the updated bottom 32-bits */
+if (qemu_tcg_mttcg_enabled()) {
+uint32_t old_le = cpu_to_le32(orig_entry);
+uint32_t new_le = cpu_to_le32(new_entry);
+MemTxResult result;
+uint32_t old_ret;
+
+old_ret = address_space_cmpxchgl_notdirty(cs->as, addr,
+  old_le, new_le,
+  MEMTXATTRS_UNSPECIFIED,
+  );
+if (result == MEMTX_OK) {
+if (old_ret != old_le) {
+new_entry = 0;
+}
+return new_entry;
+}
+
+/* Do we need to support this case where PTEs aren't in RAM ?
+ *
+ * For now fallback to non-atomic case
+ */
+}
+
+x86_stl_phys_notdirty(cs, addr, new_entry);
+
+return new_entry;
+}
+
 static hwaddr get_hphys(CPUState *cs, hwaddr gphys, MMUAccessType access_type,
 int *prot)
 {
 CPUX86State *env = _CPU(cs)->env;
-uint64_t rsvd_mask = PG_HI_RSVD_MASK;
+uint64_t rsvd_mask;
 uint64_t ptep, pte;
 uint64_t exit_info_1 = 0;
 target_ulong pde_addr, pte_addr;
@@ -172,6 +206,8 @@ static hwaddr get_hphys(CPUState *cs, hwaddr g

Re: [Qemu-devel] [PATCH v5 08/36] ppc/xive: introduce a simplified XIVE presenter

2018-11-27 Thread Benjamin Herrenschmidt
On Wed, 2018-11-28 at 10:49 +1100, David Gibson wrote:
> On Fri, Nov 16, 2018 at 11:57:01AM +0100, Cédric Le Goater wrote:
> > The last sub-engine of the XIVE architecture is the Interrupt
> > Virtualization Presentation Engine (IVPE). On HW, they share elements,
> > the Power Bus interface (CQ), the routing table descriptors, and they
> > can be combined in the same HW logic. We do the same in QEMU and
> > combine both engines in the XiveRouter for simplicity.
> 
> Ok, I'm not entirely convinced combining the IVPE and IVRE into a
> single object is a good idea, but we can probably discuss that once
> I've read further.

Keep in mind that the communication between the two is a bit more hairy
than simple notifications, though. Especially once we start
implementing escalation interrupts or worse, groups...

> > When the IVRE has completed its job of matching an event source with a
> > Notification Virtual Target (NVT) to notify, it forwards the event
> > notification to the IVPE sub-engine. The IVPE scans the thread
> > interrupt contexts of the Notification Virtual Targets (NVT)
> > dispatched on the HW processor threads and if a match is found, it
> > signals the thread. If not, the IVPE escalates the notification to
> > some other targets and records the notification in a backlog queue.
> > 
> > The IVPE maintains the thread interrupt context state for each of its
> > NVTs not dispatched on HW processor threads in the Notification
> > Virtual Target table (NVTT).
> > 
> > The model currently only supports single NVT notifications.
> > 
> > Signed-off-by: Cédric Le Goater 
> > ---
> >  include/hw/ppc/xive.h  |  13 +++
> >  include/hw/ppc/xive_regs.h |  22 
> >  hw/intc/xive.c | 223 +
> >  3 files changed, 258 insertions(+)
> > 
> > diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> > index 5987f26ddb98..e715a6c6923d 100644
> > --- a/include/hw/ppc/xive.h
> > +++ b/include/hw/ppc/xive.h
> > @@ -197,6 +197,10 @@ typedef struct XiveRouterClass {
> > XiveEND *end);
> >  int (*set_end)(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
> > XiveEND *end);
> > +int (*get_nvt)(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
> > +   XiveNVT *nvt);
> > +int (*set_nvt)(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
> > +   XiveNVT *nvt);
> 
> As with the ENDs, I don't think get/set is a good interface for a
> bigger-than-word-size object.
> 
> >  } XiveRouterClass;
> >  
> >  void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon);
> > @@ -207,6 +211,10 @@ int xive_router_get_end(XiveRouter *xrtr, uint8_t 
> > end_blk, uint32_t end_idx,
> >  XiveEND *end);
> >  int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t 
> > end_idx,
> >  XiveEND *end);
> > +int xive_router_get_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t 
> > nvt_idx,
> > +XiveNVT *nvt);
> > +int xive_router_set_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t 
> > nvt_idx,
> > +XiveNVT *nvt);
> >  
> >  /*
> >   * XIVE END ESBs
> > @@ -274,4 +282,9 @@ extern const MemoryRegionOps xive_tm_ops;
> >  
> >  void xive_tctx_pic_print_info(XiveTCTX *tctx, Monitor *mon);
> >  
> > +static inline uint32_t xive_tctx_cam_line(uint8_t nvt_blk, uint32_t 
> > nvt_idx)
> > +{
> > +return (nvt_blk << 19) | nvt_idx;
> 
> I'm guessing this formula is the standard way of combining the NVT
> block and index into a single word?  If so, I think we should
> standardize on passing a single word "nvt_id" around and only
> splitting it when we need to use the block separately.  Same goes for
> the end_id, assuming there's a standard way of putting that into a
> single word.  That will address the point I raised earlier about lisn
> being passed around as a single word, but these later stage ids being
> split.
> 
> We'll probably want some inlines or macros to build an
> nvt/end/lisn/whatever id from block and index as well.
> 
> > +}
> > +
> >  #endif /* PPC_XIVE_H */
> > diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
> > index 2e3d6cb507da..05cb992d2815 100644
> > --- a/include/hw/ppc/xive_regs.h
> > +++ b/include/hw/ppc/xive_regs.h
> > @@ -158,4 +158,26 @@ typedef struct XiveEND {
> >  #define END_W7_F1_LOG_SERVER_ID  PPC_BITMASK32(1, 31)
> >  } XiveEND;
> >  
> > +/* Notification Virtual Target (NVT) */
> > +typedef struct XiveNVT {
> > +uint32_tw0;
> > +#define NVT_W0_VALID PPC_BIT32(0)
> > +uint32_tw1;
> > +uint32_tw2;
> > +uint32_tw3;
> > +uint32_tw4;
> > +uint32_tw5;
> > +uint32_tw6;
> > +uint32_tw7;
> > +uint32_tw8;
> > +#define NVT_W8_GRP_VALID PPC_BIT32(0)
> > +uint32_tw9;
> > +

Re: [Qemu-devel] [PATCH v5 09/36] ppc/xive: notify the CPU when the interrupt priority is more privileged

2018-11-27 Thread Benjamin Herrenschmidt
On Wed, 2018-11-28 at 11:13 +1100, David Gibson wrote:
> Don't you need a cast to avoid (nsr << 8) being a shift-wider-than-size?

Shouldn't be a problem as long as it fits in an int, no ?

Cheers,
Ben.




Re: [Qemu-devel] [PATCH v5 04/36] ppc/xive: introduce the XiveRouter model

2018-11-21 Thread Benjamin Herrenschmidt
On Thu, 2018-11-22 at 15:44 +1100, David Gibson wrote:
> 
> Sorry, didn't think of this in my first reply.
> 
> 1) Does the hardware ever actually write back to the EAS?  I know it
> does for the END, but it's not clear why it would need to for the
> EAS.  If not, we don't need the setter.

Nope, though the PAPR model will via hcalls

> 
> 2) The signatures are a bit odd here.  For the setter, a value would
> make sense than a (XiveEAS *), since it's just a word.  For the getter
> you could return the EAS value directly rather than using a pointer -
> there's already a valid bit in the EAS so you can construct a value
> with that cleared if the lisn is out of bounds.




Re: [Qemu-devel] [PATCH v5 05/36] ppc/xive: introduce the XIVE Event Notification Descriptors

2018-11-21 Thread Benjamin Herrenschmidt
On Thu, 2018-11-22 at 15:41 +1100, David Gibson wrote:
> 
> > +void xive_end_reset(XiveEND *end)
> > +{
> > +memset(end, 0, sizeof(*end));
> > +
> > +/* switch off the escalation and notification ESBs */
> > +end->w1 = END_W1_ESe_Q | END_W1_ESn_Q;
> 
> It's not obvious to me what circumstances this would be called under.
> Since the ENDs are in system memory, a memset() seems like an odd
> thing for (virtual) hardware to be doing to it.
> 
> > +}

Not on PAPR ...

Ben.





Re: [Qemu-devel] [PATCH v2 1/2] ppc/pnv: Add model for Power8 PHB3 PCIe Host bridge

2018-07-27 Thread Benjamin Herrenschmidt
On Fri, 2018-07-27 at 15:32 +1000, David Gibson wrote:
> 
> > > What is this pci bridge representing?  I know PCI-e PHBs typically
> > > have a pseudo P2P bridge right under them, but isn't that represnted
> > > by the root complex above?
> > 
> > This is the legacy pci bridge under the pcie bus.
> 
> Ah, ok.  Didn't realise there was a vanilla PCI bridge built in.

There isn't. That should probably be created by the machine.

> > Here is is the qdev hierarchy :
> > 
> >   dev: pnv-phb3, id ""
> > index = 0 (0x0)
> > chip-id = 0 (0x0)
> > bus: phb3-root-bus
> >   type pnv-phb3-root-bus
> >   dev: pnv-phb3-rc, id ""
> > power_controller_present = true
> > chassis = 0 (0x0)
> > slot = 1 (0x1)
> > port = 0 (0x0)
> > aer_log_max = 8 (0x8)
> > addr = 00.0
> > romfile = ""
> > rombar = 1 (0x1)
> > multifunction = false
> > command_serr_enable = true
> > x-pcie-lnksta-dllla = true
> > x-pcie-extcap-init = true
> > class PCI bridge, addr 00:00.0, pci id 1014:03dc (sub :)
> > bus: pcie.0
> >   type PCIE
> >   dev: pci-bridge, id ""
> > chassis_nr = 128 (0x80)
> > msi = "off"
> > shpc = false
> > addr = 00.0
> > romfile = ""
> > rombar = 1 (0x1)
> > multifunction = false
> > command_serr_enable = true
> > x-pcie-lnksta-dllla = true
> > x-pcie-extcap-init = true
> > class PCI bridge, addr 00:00.0, pci id 1b36:0001 (sub :)
> > bus: pci.0
> >   type PCI
> > 
> > [ ... ]
> > 
> > >> +static const PropertyInfo pnv_phb3_phb_id_propinfo = {
> > >> +.name = "irq",
> > >> +.get = pnv_phb3_get_phb_id,
> > >> +.set = pnv_phb3_set_phb_id,
> > >> +};
> > > 
> > > Can't you use a static DeviceProps style property for this, which is a
> > > bit simpler?
> > 
> > OK. We will address user creatable PHBs in some other way. Most
> > certainly in the realize routine like you suggested.
> > 
> > Thanks,
> > 
> > C. 
> > 
> 




Re: [Qemu-devel] [PATCH v2 1/2] ppc/pnv: Add model for Power8 PHB3 PCIe Host bridge

2018-07-27 Thread Benjamin Herrenschmidt
On Fri, 2018-07-27 at 10:25 +0200, Cédric Le Goater wrote:
> Each PHB creates a pci-bridge device and the PCI bus that comes with it. 
> It makes things easier to define PCI devices. 
> 
> It is still quite complex ... Here is a sample :
> 
> qemu-system-ppc64 -m 2G -machine powernv \
>   -cpu POWER8 -smp 2,cores=2,threads=1 -accel tcg,thread=multi \
>   -kernel ./zImage.epapr -initrd ./rootfs.cpio.xz -bios ./skiboot.lid \
>   \
>   -device megasas,id=scsi0,bus=pci.0,addr=0x1 \
>   -drive 
> file=./rhel7-ppc64le.qcow2,if=none,id=drive-scsi0-0-0-0,format=qcow2,cache=none
>  \
>   -device 
> scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=2
>  \
>   \
>   -device ich9-ahci,id=sata0,bus=pci.1,addr=0x1 \
>   -drive 
> file=./ubuntu-ppc64le.qcow2,if=none,id=drive0,format=qcow2,cache=none \
>   -device ide-hd,bus=sata0.0,unit=0,drive=drive0,id=ide,bootindex=1 \
>   -device e1000,netdev=net0,mac=C0:FF:EE:00:00:02,bus=pci.1,addr=0x2 \
>   -netdev bridge,helper=/usr/libexec/qemu-bridge-helper,br=virbr0,id=net0 \
>   -device nec-usb-xhci,bus=pci.1,addr=0x7 \

I don't understand why. That means you can't put emulated (or real)
PCIe device below it ? Why force them down the path of having a bridge
to legacy PCI always ?

My original intent was to have one such bridge in a machine for the
various default PCI devices qemu has that aren't (yet) PCIe (and also
for testing :-) but I never thought we'd throw one onto every PCIe bus.

Ben.





Re: [Qemu-devel] [PATCH v2 1/2] ppc/pnv: Add model for Power8 PHB3 PCIe Host bridge

2018-07-27 Thread Benjamin Herrenschmidt
On Fri, 2018-07-27 at 09:16 +0200, Cédric Le Goater wrote:
> > I'd have to remember how PQ works on P8 ... my gut feeling is that we
> > should resend if P=1 but I'm no 100% certain.
> 
> This is not exactly what the code does. To force a resend, it ignores 
> P but if Q=1, it bails out without doing anything, like a normal trigger 
> would do. So I think that in the resend case we should ignore Q as well.

Not too sure. We might need to swallow the resend if PQ became 01.

I don't know for sure what the HW does. Does it unconditionally resend
or does it re-evaluate the priority etc... it's unclear to me.

But yes, we should still resend if PQ=11

Cheers,
Ben.





Re: [Qemu-devel] [PATCH v2 1/2] ppc/pnv: Add model for Power8 PHB3 PCIe Host bridge

2018-07-26 Thread Benjamin Herrenschmidt
On Thu, 2018-07-26 at 11:03 +0200, Cédric Le Goater wrote:
> Ben,
> 
> I have found out recently that the QEMU PowerNV could hang while accessing
> the disk. 
> 
> The issue seems to be the phb3_msi_try_send() routine when called from 
> the resend() handler. The 'P' is ignored in that case but not the 'Q' 
> bit which means that no interrupt will be resent if P|Q are set. 

I'd have to remember how PQ works on P8 ... my gut feeling is that we
should resend if P=1 but I'm no 100% certain.

> See the log extract below  :
> 
>   PHB3(phb3_msi_try_send): MSI 0: try_send, ive=0x200500fd eff pq=0 
> prio=5 server=8 ignore_p=0
>   PHB3(phb3_msi_set_p): MSI 0: setting P
>   PHB3(phb3_msi_set_p):  IVE readback: 0x200501fd
>   PHB3(phb3_msi_reject): MSI 0 rejected
>   PHB3(phb3_msi_resend): MSI resend...
>   PHB3(phb3_msi_try_send): MSI 0: try_send, ive=0x200501fd eff pq=0 
> prio=5 server=8 ignore_p=1
>   PHB3(phb3_msi_set_p): MSI 0: setting P
>   PHB3(phb3_msi_set_p):  IVE readback: 0x200501fd
>   PHB3(phb3_msi_reject): MSI 0 rejected
>   PHB3(phb3_msi_try_send): MSI 0: try_send, ive=0x200501fd eff pq=2 
> prio=5 server=8 ignore_p=0
>   PHB3(phb3_msi_set_q): MSI 0: setting Q
>   PHB3(phb3_msi_set_q):  IVE readback: 0x2005010100fd
>   PHB3(phb3_msi_resend): MSI resend...
>   PHB3(phb3_msi_try_send): MSI 0: try_send, ive=0x2005010100fd eff pq=1 
> prio=5 server=8 ignore_p=1
>   PHB3(phb3_msi_try_send): MSI 0: try_send, ive=0x2005010100fd eff pq=3 
> prio=5 server=8 ignore_p=0
>   PHB3(phb3_msi_try_send): MSI 0: try_send, ive=0x2005010100fd eff pq=3 
> prio=5 server=8 ignore_p=0
>   PHB3(phb3_msi_try_send): MSI 0: try_send, ive=0x2005010100fd eff pq=3 
> prio=5 server=8 ignore_p=0
>   PHB3(phb3_msi_try_send): MSI 0: try_send, ive=0x2005010100fd eff pq=3 
> prio=5 server=8 ignore_p=0
>   ... goes on and on ...
>   hangs
> 
> I have added the relevant code at the bottom of the email. 
> 
> 
> If the 'Q' bit is ignored also, the results are good with a SATA drive 
> or a SCSI drive using the megasas model. Do you think this is correct ?
> I would say so but I am still discovering that part.
> 
> I have no idea why it didn't show up before. May be because we mostly 
> used virtio-blk.
> 
> Thanks,
> 
> C. 
> 
> > +static void phb3_msi_try_send(Phb3MsiState *msi, int srcno, bool ignore_p)
> > +{
> > +ICSState *ics = ICS_BASE(msi);
> > +uint64_t ive;
> > +uint64_t server, prio, pq, gen;
> > +
> > +if (!phb3_msi_read_ive(msi->phb, srcno, )) {
> > +return;
> > +}
> > +
> > +server = GETFIELD(IODA2_IVT_SERVER, ive);
> > +prio = GETFIELD(IODA2_IVT_PRIORITY, ive);
> > +pq = GETFIELD(IODA2_IVT_Q, ive);
> > +if (!ignore_p) {
> > +pq |= GETFIELD(IODA2_IVT_P, ive) << 1;
> > +}
> > +gen = GETFIELD(IODA2_IVT_GEN, ive);
> > +
> > +/*
> > + * The low order 2 bits are the link pointer (Type II interrupts).
> > + * Shift back to get a valid IRQ server.
> > + */
> > +server >>= 2;
> > +
> > +switch (pq) {
> > +case 0: /* 00 */
> > +if (prio == 0xff) {
> > +/* Masked, set Q */
> > +phb3_msi_set_q(msi, srcno);
> > +} else {
> > +/* Enabled, set P and send */
> > +phb3_msi_set_p(msi, srcno, gen);
> > +icp_irq(ics, server, srcno + ics->offset, prio);
> > +}
> > +break;
> > +case 2: /* 10 */
> > +/* Already pending, set Q */
> > +phb3_msi_set_q(msi, srcno);
> > +break;
> > +case 1: /* 01 */
> > +case 3: /* 11 */
> > +default:
> > +/* Just drop stuff if Q already set */
> > +break;
> > +}
> > +}
> > +
> > +static void phb3_msi_set_irq(void *opaque, int srcno, int val)
> > +{
> > +Phb3MsiState *msi = PHB3_MSI(opaque);
> > +
> > +if (val) {
> > +phb3_msi_try_send(msi, srcno, false);
> > +}
> > +}
> 
> [ ... ]
>  
> > +static void phb3_msi_resend(ICSState *ics)
> > +{
> > +Phb3MsiState *msi = PHB3_MSI(ics);
> > +unsigned int i, j;
> > +
> > +if (msi->rba_sum == 0) {
> > +return;
> > +}
> > +
> > +for (i = 0; i < 32; i++) {
> > +if ((msi->rba_sum & (1u << i)) == 0) {
> > +continue;
> > +}
> > +msi->rba_sum &= ~(1u << i);
> > +for (j = 0; j < 64; j++) {
> > +if ((msi->rba[i] & (1ull << j)) == 0) {
> > +continue;
> > +}
> > +msi->rba[i] &= ~(1u << j);
> > +phb3_msi_try_send(msi, i * 64 + j, true);
> > +}
> > +}
> > +}
> 
> 




Re: [Qemu-devel] [PATCH v2 1/2] ppc/pnv: Add model for Power8 PHB3 PCIe Host bridge

2018-07-23 Thread Benjamin Herrenschmidt
On Tue, 2018-07-24 at 12:14 +1000, David Gibson wrote:
> > I don't know, is there much shared logic ? And the shared bits are the
> > subclassing, that's handled that way...
> > 
> > This is really a different piece of HW, a separate ICS implementation,
> > that has its own quirks, is configured via different registers, does
> > EOI differently etc...
> > 
> > Even the resend stuff is done differently, the resend bitmap is
> > actually SW visible via an IODA table.
> > 
> > I mean, we could (maybe we do these days not sure) have an ICS
> > superclass wrapper on the actual icp_irq() but that's really all there
> > is to it I think.
> 
> Hm, yeah, fair enough.  I realized what I was suggesting would
> actually need another layer of qirqs than we have, so it's not a good
> idea after all.  Ok, I'm happy proceeding with exposing icp_irq().

The only idea I have if you want to keep icp_* private is to put a
'helper' in the ICS superclass that the subclass uses to encapsulate
the ICP call, but at this point it's just churn.

But yeah you really don't want a layer of qirqs, it wouldn't work any
way, it's really an ICS, it will send messages to the ICP with a XIRR
value in them etc... just like an ICS, it's not somethign you want
qirq's for .

Cheers,
Ben.





Re: [Qemu-devel] [PATCH v2 1/2] ppc/pnv: Add model for Power8 PHB3 PCIe Host bridge

2018-07-23 Thread Benjamin Herrenschmidt
On Mon, 2018-07-23 at 14:16 +1000, David Gibson wrote:
> > 
> > Now, this is an ICS subclass, so why shouldn't it directly poke at the
> > target ICP ?
> 
> That's ok in theory, but causing it to expose the icp interface to a
> new module isn't great.
> 
> > It's an alternate to the normal ICS since it behaves a bit
> > differently (PQ bits & all).
> 
> AFAICT the PQ bits are effectively another filtering layer on top of
> the same ICS logic as elsewhere.  I think it's better if we can share
> that shared logic, rather than replicating it.

I don't know, is there much shared logic ? And the shared bits are the
subclassing, that's handled that way...

This is really a different piece of HW, a separate ICS implementation,
that has its own quirks, is configured via different registers, does
EOI differently etc...

Even the resend stuff is done differently, the resend bitmap is
actually SW visible via an IODA table.

I mean, we could (maybe we do these days not sure) have an ICS
superclass wrapper on the actual icp_irq() but that's really all there
is to it I think.

Cheers,
Ben.





Re: [Qemu-devel] [PATCH v2 1/2] ppc/pnv: Add model for Power8 PHB3 PCIe Host bridge

2018-07-18 Thread Benjamin Herrenschmidt
On Wed, 2018-07-18 at 16:12 +1000, David Gibson wrote:
> On Thu, Jun 28, 2018 at 10:36:32AM +0200, Cédric Le Goater wrote:
> > From: Benjamin Herrenschmidt 

Can you trim your replies ? It's really hard to find your comments in
such a long patch...

> > diff --git a/include/hw/ppc/xics.h b/include/hw/ppc/xics.h
> > index 6ac8a9392da6..966a996c2eac 100644
> > --- a/include/hw/ppc/xics.h
> > +++ b/include/hw/ppc/xics.h
> > @@ -194,6 +194,7 @@ void icp_set_mfrr(ICPState *icp, uint8_t mfrr);
> >  uint32_t icp_accept(ICPState *ss);
> >  uint32_t icp_ipoll(ICPState *ss, uint32_t *mfrr);
> >  void icp_eoi(ICPState *icp, uint32_t xirr);
> > +void icp_irq(ICSState *ics, int server, int nr, uint8_t priority);
> 
> Hrm... are you sure you need to expose this?

See further down...

> > +/*
> > + * The CONFIG_DATA register expects little endian accesses, but as the
> > + * region is big endian, we have to swap the value.
> > + */
> > +static void pnv_phb3_config_write(PnvPHB3 *phb, unsigned off,
> > +  unsigned size, uint64_t val)
> > +{
> > +uint32_t cfg_addr, limit;
> > +PCIDevice *pdev;
> > +
> > +pdev = pnv_phb3_find_cfg_dev(phb);
> > +if (!pdev) {
> > +return;
> > +}
> > +cfg_addr = (phb->regs[PHB_CONFIG_ADDRESS >> 3] >> 32) & 0xfff;
> > +cfg_addr |= off;
> 
> This looks weird - there are callers of this that appear to have low
> bits in 'off', then you're ORing it with overlapping low bits.

Should be ffc like the read case.

> 
> > +limit = pci_config_size(pdev);
> > +if (limit <= cfg_addr) {
> > +/* conventional pci device can be behind pcie-to-pci bridge.
> > +   256 <= addr < 4K has no effects. */
> > +return;
> > +}
> > +switch (size) {
> > +case 1:
> > +break;
> > +case 2:
> > +val = bswap16(val);
> > +break;
> > +case 4:
> > +val = bswap32(val);
> > +break;
> > +default:
> > +g_assert_not_reached();
> > +}
> > +pci_host_config_write_common(pdev, cfg_addr, limit, val, size);
> > +}
> > +
> > +static uint64_t pnv_phb3_config_read(PnvPHB3 *phb, unsigned off,
> > + unsigned size)
> > +{
> > +uint32_t cfg_addr, limit;
> > +PCIDevice *pdev;
> > +uint64_t val;
> > +
> > +pdev = pnv_phb3_find_cfg_dev(phb);
> > +if (!pdev) {
> > +return ~0ull;
> > +}
> > +cfg_addr = (phb->regs[PHB_CONFIG_ADDRESS >> 3] >> 32) & 0xffc;
> > +cfg_addr |= off;
> 
> This looks better, should it be 0xffc in the write path as well?
> 
> > +limit = pci_config_size(pdev);
> > +if (limit <= cfg_addr) {
> > +/* conventional pci device can be behind pcie-to-pci bridge.
> > +   256 <= addr < 4K has no effects. */
> > +return ~0ull;
> > +}
> > +val = pci_host_config_read_common(pdev, cfg_addr, limit, size);
> > +switch (size) {
> > +case 1:
> > +return val;
> > +case 2:
> > +return bswap16(val);
> > +case 4:
> > +return bswap32(val);
> > +default:
> > +g_assert_not_reached();
> > +}
> > +}
> > +
> > +static void pnv_phb3_check_m32(PnvPHB3 *phb)
> > +{
> > +uint64_t base, start, size;
> > +MemoryRegion *parent;
> > +PnvPBCQState *pbcq = >pbcq;
> > +
> > +if (phb->m32_mapped) {
> > +memory_region_del_subregion(phb->mr_m32.container, >mr_m32);
> > +phb->m32_mapped = false;
> 
> Could you use memory_region_set_enabled() rather than having to add
> and delete the subregion and keep track of it in this separate
> variable?

There was a reason for that but it's years old and I forgot... I think
when re-enabled it moves around, among other things. So it's not more
efficient.

> > +}
> > +
> > +/* Disabled ? move on with life ... */
> 
> Trivial: "nothing to do" seems to be the standard way to express this.
> Even trivialler: usual English rules don't put a space before '?' or
> '!' punctuation.

No, that's the tasteless English rule :-) It shall be overridden by
anybody interested in making things actually readable :-)

> > +
> > +static void pnv_phb3_lxivt_write(PnvPHB3 *phb, unsigned idx, uint64_t val)
> > +{
> > +ICSState *ics = >lsis;
> > +uin

Re: [Qemu-devel] [PATCH] ppc/pnv: Add model for Power8 PHB3 PCIe Host bridge

2018-06-28 Thread Benjamin Herrenschmidt
On Thu, 2018-06-28 at 10:00 +0200, Andrea Bolognani wrote:
> On Thu, 2018-06-28 at 13:59 +1000, David Gibson wrote:
> > On Wed, Jun 27, 2018 at 12:22:31PM +0200, Andrea Bolognani wrote:
> > > On Tue, 2018-06-26 at 19:02 +0200, Cédric Le Goater wrote:
> > > > I didn't follow that discussion but this is "another" kind of PHB.
> > > > This one models the baremetal controller as found on OpenPOWER and
> > > > IBM Power machines. pSeries has a virtual PHB.
> > > 
> > > I understand that, and of course libvirt will need to learn about
> > > this new type of PHB and make sure both pSeries and PowerNV guests
> > > get the correct one assigned to them.
> > 
> > Hmm.. does it?  I would have thought pnv could act more like x86, in
> > that libvirt doesn't attempt to create PHBs at all and just use the
> > ones that are built in.
> 
> AFAIK x86 guests have a single PHB and additional ones cannot be
> created in any way, which means we don't have to do any additional
> second-guessing when assigning IDs to additional PCI controllers.

That's a surprising limitation. A single PHB only supports a limited
number of MSIs no ? And only 256 bus numbers...

> > Though, come to that, I wouldn't think pnv support for libvirt would
> > be much of a priority anyway.  The machine type is still very much in
> > flux, and it's designed primarily for testing and development, not
> > "real world" usage.
> 
> Can you *guarantee* that someone won't ask for PowerNV support in
> libvirt at some point? Because if you can't (and I don't think you
> can ;) then this is still a valuable conversation to have.

It's rather unlikely for now as there is no KVM suport for it (it's
tricky, our chips aren't designed for full virtualization). That might
change in the future but not soon.

> > > What I meant is that pSeries guests get a single PHB by default,
> > > with additional ones being instantiable through -device; this is
> > > also consistent with how PCI controllers are added to other guest
> > > types including pc, q35 and aarch64/virt, so it would be really
> > > nice if PowerNV behaved the same way.
> > 
> > Well.. sure.. but it doesn't.  pSeries is a virtual platform, so we
> > have a reasonable amount of flexibility to define it as we want.
> > PowerNV is an emulation of existing hardware which has a specific
> > behaviour which we need to match.
> 
> Sure, that's something to keep in mind.
> 
> But the thing is, you still need to have *some* flexibility in
> the number of PHBs, since there is variation among real Power8
> and Power9 chips; in the current incarnation, that flexibility
> is provided by the num_phbs parameter, which is an entirely new
> interface that's exclusive to PowerNV.
> 
> What I'm suggesting is that the same amount of flexibility is
> offered through a standard interface, namely -device, instead.

But that's harder internally to qemu to properly "bind" to the chip
where the PHB resides etc... 

Cheers,
Ben.




Re: [Qemu-devel] [PATCH] ppc/pnv: Add model for Power8 PHB3 PCIe Host bridge

2018-06-27 Thread Benjamin Herrenschmidt
On Wed, 2018-06-27 at 09:46 +0200, Cédric Le Goater wrote:
> So the "IBM PHB3 PCIE Root Port" is already user createable.
> 
> I can take a look at user createable PHB3s. I think this is OK from a model
> perspective. The object is rather standalone, it needs the machine for 
> the XICS fabric and a couple of ids, phb id and chip id. These can come
> from the command line.
> 
> We want at least one PHB3 per socket/chip though. 

We don't want the user to specify the SCOM addresses though (for the
MMIO windows we should get skiboot to assign them).

If the user gets to specify a thing it would be which of the 3 or 4 HW
PHBs of the chip it is, the SCOM addresses gets deduced.

Cheers,
Ben.



Re: [Qemu-devel] [PATCH] ppc/pnv: Add model for Power8 PHB3 PCIe Host bridge

2018-06-26 Thread Benjamin Herrenschmidt
On Wed, 2018-06-27 at 03:35 +0300, Michael S. Tsirkin wrote:
> 
> > +
> > +/* Extract field fname from val */
> > +#define GETFIELD(fname, val)\
> > +(((val) & fname##_MASK) >> fname##_LSH)
> > +
> > +/* Set field fname of oval to fval
> > + * NOTE: oval isn't modified, the combined result is returned
> > + */
> > +#define SETFIELD(fname, oval, fval) \
> > +(((oval) & ~fname##_MASK) | \
> > + typeof(oval))(fval)) << fname##_LSH) & fname##_MASK))
> > +
> 
> Pls don't make up macros like these. We can't have each device do it.

So what ? We move the macros in a generic place ? These are MUCH better
than open-coding the masks & shifts and much less error prone.

> > @@ -1031,6 +1110,7 @@ static Property pnv_chip_properties[] = {
> >  DEFINE_PROP_UINT64("ram-size", PnvChip, ram_size, 0),
> >  DEFINE_PROP_UINT32("nr-cores", PnvChip, nr_cores, 1),
> >  DEFINE_PROP_UINT64("cores-mask", PnvChip, cores_mask, 0x0),
> > +DEFINE_PROP_UINT32("num-phbs", PnvChip, num_phbs, 1),
> >  DEFINE_PROP_END_OF_LIST(),
> >  };
> 
> How about instanciating each extra phb using -device?
> Seems better than teaching everyone about platform-specific
> options.

It's about which PHBs are enabled, not which are instanciated, if I
understand Cedric changes ...

This aims are implementing the POWER8 chip correctly, it has a fixed
number of PHBs per-chip at very specific XSCOM addresses, that the
firwmare knows about.

Cheers,
Ben.




Re: [Qemu-devel] [PATCH] ppc/pnv: Add model for Power8 PHB3 PCIe Host bridge

2018-06-26 Thread Benjamin Herrenschmidt
On Tue, 2018-06-26 at 17:57 +0200, Andrea Bolognani wrote:
> On Tue, 2018-06-26 at 15:59 +0200, Cédric Le Goater wrote:
> > This is a model of the PCIe host bridge found on Power8 chips,
> > including PowerBus logic interface, IOMMU support, PCIe root complex,
> > XICS MSI and LSI interrupt sources.
> > 
> > 4 PHBs are provisioned under the Power8 chip model to fit hardware but
> > only one is currently initialized.
> 
> What's the advantage in creating 4 PHBs instead of a single one,
> like we already do for pSeries guests? As it is, this will confuse
> the heck out of libvirt's PCI address allocation algorithm :)

This matches the actual HW. POWER9 will have 6 per chip :-)

The goal of the "powernv" platform in qemu is to closely match the
actual HW.

Note that pseries guests can (and will under some cirscumstances) have
multiple PHBs as well.

Cheers,
Ben.




Re: [Qemu-devel] [Qemu-ppc] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller

2018-02-12 Thread Benjamin Herrenschmidt
On Mon, 2018-02-12 at 13:20 +0100, Andrea Bolognani wrote:
> On Mon, 2018-02-12 at 13:02 +1100, Alexey Kardashevskiy wrote:
> > On 12/02/18 09:55, Benjamin Herrenschmidt wrote:
> > > Well, we have a problem then. It looks like Qemu broken migration is
> > > fundamentally incompatible with PAPR and CAS design...
> > > 
> > > I know we don't migrate the configuration, that's not exactly what I
> > > had in mind tho... Can we have some piece of *data* from the machine be
> > > migrated first, and use it on the target to reconfigure the interrupt
> > > controller before the stream arrives ?
> > 
> > These days this is done via libvirt - it reads properties it needs via QMP,
> > then sends an XML with everything (the interrupt controller type may be one
> > of such properties), and starts the destination QEMU with the explicit
> > interrupt controller (like -machine pseries,intrc=xive).
> 
> Clarification: libvirt will use the user-defined XML configuration
> to generate the QEMU command line both for the source and the target
> of the migration, but it will not automagically figure out properties
> through QMP. So if you want the controller to explicitly show up on
> the QEMU command line, libvirt should be taught about it.

Which can't work because the guest pretty much decides what it will be
early on during the boot process.

So we're back to square 1 having to instanciate both objects in qemu
with some kind of "activation" flag.

Cheers,
Ben.




Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller

2018-02-11 Thread Benjamin Herrenschmidt
On Sun, 2018-02-11 at 19:08 +1100, David Gibson wrote:
> On Thu, Jan 18, 2018 at 08:27:52AM +1100, Benjamin Herrenschmidt wrote:
> > On Wed, 2018-01-17 at 15:39 +0100, Cédric Le Goater wrote:
> > > Migration is a problem. We will need both backend QEMU objects to be 
> > > available anyhow if we want to migrate. So we are back to the current 
> > > solution creating both QEMU objects but we can try to defer some of the 
> > > KVM inits and create the KVM device on demand at CAS time.
> > 
> > Do we have a way to migrate a piece of info from the machine *first*
> > that indicate what type of XICS/XIVE to instanciate ?
> 
> Nope.  qemu migration doesn't work like that.  Yes, it should, and
> everyone knows it, but changing it is a really long term project.

Well, we have a problem then. It looks like Qemu broken migration is
fundamentally incompatible with PAPR and CAS design...

I know we don't migrate the configuration, that's not exactly what I
had in mind tho... Can we have some piece of *data* from the machine be
migrated first, and use it on the target to reconfigure the interrupt
controller before the stream arrives ?

Otherwise, we have indeed no much choice but the horrible wart of
creating both interrupt controllers with only one "active".

> > 
> > > The next problem is the ICP object that currently needs the KVM device 
> > > fd to connect the vcpus ... So, we will need to change that also. 
> > > That is probably the biggest problem today. We need a way to disconnect 
> > > the vpcu from the KVM device and see how we can defer the connection.
> > > I need to make sure this is possible, I can check that without XIVE
> > 
> > Ben.
> > 
> 
> 



Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller

2018-01-18 Thread Benjamin Herrenschmidt
On Thu, 2018-01-18 at 14:27 +0100, Cédric Le Goater wrote:
> The source and the target machines should have the same realized 
> objects. I think this is the simplest solution to keep the migration 
> framework maintainable. 

Yeah well, it all boils down to qemu migration being completely brain
dead in relying on an external entity to create the same machine rather
than carrying the configuration in the migration stream... ugh.

> I don't think it is a problem to call a xics_fini() routine to 
> destroy the XICS KVM device if a new interrupt mode was negotiated
> in CAS. We would then call a xive_init() routing to create the new 
> XIVE KVM device.
> 
> When done, the question boils down to disconnect and reconnect the 
> vcpus to the KVM device. The QEMU CPU ->intc pointer should be 
> updated also but that's a QEMU level problem. Already done.

The problem is more the in-kernel hooks.
 
> In the QEMU "icp-kvm" object, the connection to the KVM device 
> is currently forced in the realize routine but we can add some 
> handlers to manage the link. Similar handlers would do the same 
> in the QEMU "nvt-kvm" object when XIVE is on.
> 
> 
> If we think this is a possible way to address the problem, I can 
> check the above thinking on a XICS KVM machine and force the 
> init/fini sequence in the CAS negotiation process. I will need 
> a KVM ioctl to destroy a device and maybe a KVM VCPU ioctl to 
> disable a capability. 



Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller

2018-01-17 Thread Benjamin Herrenschmidt
On Wed, 2018-01-17 at 15:39 +0100, Cédric Le Goater wrote:
> Migration is a problem. We will need both backend QEMU objects to be 
> available anyhow if we want to migrate. So we are back to the current 
> solution creating both QEMU objects but we can try to defer some of the 
> KVM inits and create the KVM device on demand at CAS time.

Do we have a way to migrate a piece of info from the machine *first*
that indicate what type of XICS/XIVE to instanciate ?

> The next problem is the ICP object that currently needs the KVM device 
> fd to connect the vcpus ... So, we will need to change that also. 
> That is probably the biggest problem today. We need a way to disconnect 
> the vpcu from the KVM device and see how we can defer the connection.
> I need to make sure this is possible, I can check that without XIVE

Ben.




Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller

2018-01-17 Thread Benjamin Herrenschmidt
On Wed, 2018-01-17 at 10:18 +0100, Cédric Le Goater wrote:
> > > > Also, have we decided how the process of switching between XICS and
> > > > XIVE will work vs. CAS ? 
> > > 
> > > That's how it is described in the architecture. The current choice is
> > > to create both XICS and XIVE objects and choose at CAS which one to
> > > use. It relies today on the capability of the pseries machine to 
> > > allocate IRQ numbers for both interrupt controller backends. These
> > > patches have been merged in QEMU.
> > > 
> > > A change of interrupt mode results in a reset. The device tree is 
> > > populated accordingly and the ICPs are switched for the model in 
> > > use. 
> > 
> > For KVM we need to only instanciate one of them though.
> 
> Hmm,
> 
> How would we handle a guest rebooting on a kernel without XIVE support ? 

It will do CAS again and we can change the devices.

> Are you suggesting to create the XICS or XIVE device in the CAS negotiation 
> process ? So, the machine would not have any interrupt controller before 
> CAS. That seems really late to me. grub uses the console for instance. 

We start with XICS by default.

> I think it should prepare for both options, start in XIVE legacy mode, 
> which is XICS, then possibly switch to XIVE exploitation mode.
> 
> > > > And how that will interact with KVM ? 
> > > 
> > > I expect we will do the same, which is to create two KVM devices to 
> > > be able to handle both interrupt controller backends depending on the 
> > > mode negotiated by the guest.  
> > 
> > That will be an ungodly mess, I'd rather we only instanciate the right
> > one.
> 
> It's rather transparent currently in the emulated version. There are two 
> sets of objects in QEMU, switching is done in CAS. KVM support should not 
> change anything in that area. 
> 
> I expect the 'xive-kvm' object to get/set states for migration, just like 
> for XICS and to setup the ESB+TIMA memory regions, which is new. 

But both XICS and XIVE are completely different kernel KVM devices that will
need to "hook" into the same set of internal hooks for things like interrupts
being passed through, RTAS calls etc... 

How does KVM knows which one to "activate" ?

I don't think the kernel should have both. 

> > > > I was
> > > > thinking the kernel would implement a different KVM device type, ie
> > > > the "emulated XICS" would remain KVM_DEV_TYPE_XICS and XIVE would be
> > > > KVM_DEV_TYPE_XIVE.
> > > 
> > > yes. it makes sense. The new device will have a lot in common with the 
> > > KVM_DEV_TYPE_XICS using kvm_xive_ops.
> > 
> > Ben.
> > 



Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller

2017-12-21 Thread Benjamin Herrenschmidt
On Thu, 2017-12-21 at 10:16 +0100, Cédric Le Goater wrote:
> On 12/21/2017 01:12 AM, Benjamin Herrenschmidt wrote:
> > On Wed, 2017-12-20 at 16:09 +1100, David Gibson wrote:
> > > 
> > > As you've suggested in yourself, I think we might need to more
> > > explicitly model the different components of the XIVE system.  As part
> > > of that, I think you need to be clearer in this base skeleton about
> > > exactly what component your XIVE object represents.
> > > 
> > > If the answer is "the overall thing" I suspect that's not what you
> > > want - I had one of those for XICs which proved to be a mistake
> > > (eventually replaced by the XICSFabric interface).
> > > 
> > > Changing the model later isn't impossible, but doing so without
> > > breaking migration can be a real pain, so I think it's worth a
> > > reasonable effort to try and get it right initially.
> > 
> > Note: we do need to speed things up a bit, as having exploitation mode
> > in KVM will significantly help with IPI performance among other things.
> > 
> > I'm about ready to do the KVM bits. The one thing we need to discuss
> > and figure a good design for is how we map all those interrupt control
> > pages into qemu.
> > 
> > Each interrupt (either PCIe pass-through or the "generic XIVE IPIs"
> > which are used for guest IPIs and for vio/virtio/emulated interrupts)
> > comes with a "control page" (ESB page) which needs to be mapped into
> > the guest, and the generic IPIs also come with a trigger page which
> > needs to be mapped into the guest for guest IPIs or OpenCAPI
> > interrupts, or just qemu for emulated devices.
> 
> what about the OS TIMA page ? Do we trap the accesses in QEMU and
> forward them to KVM ? or do we use a similar mechanism. 

No, no, we'll have an mmap facility for it in kvm but it worries me
less as there's only one of these and there's little damage qemu can do
having access to it :)
> 
> > Now that can be thousands of these critters. I certainly don't want to
> > create thousands of VMAs in qemu and even less thousands of memory
> > regions in KVM.
> 
> we can provision one mapping per kvmppc_xive_src_block  maybe ?  

Maybe. Last I looked KVM walk of memory regions was linear though. Mind
you it's not a huge deal if the guest RAM is always in the first
entries.

> > So we need some kind of mechanism by wich a single large VMA gets
> > mmap'ed into qemu (or maybe a couple of these, but not too many) and
> > the interrupt pages can be assigned to slots in there and demand
> > faulted.
> 
> Frederic has started to put in place a similar mecanism for OpenCAPI.

I know, though he made it rather OpenCAPI specific which is going to be
"interesting" when it comes to virtualizing OpenCAPI...

> > For the generic interrupts, this can probably be covered by KVM, adding
> > some arch ioctls for allocating IPIs and mmap'ing that region etc...
> 
> The KVM device has a ioctl handler :
>
>   struct kvm_device_ops {
> 
>   long (*ioctl)(struct kvm_device *dev, unsigned int ioctl,
> unsigned long arg);
>   };
> 
> So a KVM device for the XIVE interrupt controller can implement a couple 
> of extra calls for its need, like getting the VMA addresses, etc
> 
> > For pass-through, it's trickier, we don't want to mmap each irqfd
> > individually for the above reason, so we want to "link" them to KVM. We
> > don't want to allow qemu to take control of any arbitrary interrupt in
> > the system though, so it has to related to the ownership of the irqfd
> > coming from vfio.
> > 
> > OpenCAPI I suspect will be its own can of worms...
> > 
> > Also, have we decided how the process of switching between XICS and
> > XIVE will work vs. CAS ? 
> 
> That's how it is described in the architecture. The current choice is
> to create both XICS and XIVE objects and choose at CAS which one to
> use. It relies today on the capability of the pseries machine to 
> allocate IRQ numbers for both interrupt controller backends. These
> patches have been merged in QEMU.
> 
> A change of interrupt mode results in a reset. The device tree is 
> populated accordingly and the ICPs are switched for the model in 
> use. 

For KVM we need to only instanciate one of them though.

> > And how that will interact with KVM ? 
> 
> I expect we will do the same, which is to create two KVM devices to 
> be able to handle both interrupt controller backends depending on the 
> mode negotiated by the guest.  

That will be an ungodly mess, I'd rather we only instanciate the right
one.

> > I was
> > thinking the kernel would implement a different KVM device type, ie
> > the "emulated XICS" would remain KVM_DEV_TYPE_XICS and XIVE would be
> > KVM_DEV_TYPE_XIVE.
> 
> yes. it makes sense. The new device will have a lot in common with the 
> KVM_DEV_TYPE_XICS using kvm_xive_ops.

Ben.




Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller

2017-12-20 Thread Benjamin Herrenschmidt
On Wed, 2017-12-20 at 16:09 +1100, David Gibson wrote:
> 
> As you've suggested in yourself, I think we might need to more
> explicitly model the different components of the XIVE system.  As part
> of that, I think you need to be clearer in this base skeleton about
> exactly what component your XIVE object represents.
> 
> If the answer is "the overall thing" I suspect that's not what you
> want - I had one of those for XICs which proved to be a mistake
> (eventually replaced by the XICSFabric interface).
> 
> Changing the model later isn't impossible, but doing so without
> breaking migration can be a real pain, so I think it's worth a
> reasonable effort to try and get it right initially.

Note: we do need to speed things up a bit, as having exploitation mode
in KVM will significantly help with IPI performance among other things.

I'm about ready to do the KVM bits. The one thing we need to discuss
and figure a good design for is how we map all those interrupt control
pages into qemu.

Each interrupt (either PCIe pass-through or the "generic XIVE IPIs"
which are used for guest IPIs and for vio/virtio/emulated interrupts)
comes with a "control page" (ESB page) which needs to be mapped into
the guest, and the generic IPIs also come with a trigger page which
needs to be mapped into the guest for guest IPIs or OpenCAPI
interrupts, or just qemu for emulated devices.

Now that can be thousands of these critters. I certainly don't want to
create thousands of VMAs in qemu and even less thousands of memory
regions in KVM.

So we need some kind of mechanism by wich a single large VMA gets
mmap'ed into qemu (or maybe a couple of these, but not too many) and
the interrupt pages can be assigned to slots in there and demand
faulted.

For the generic interrupts, this can probably be covered by KVM, adding
some arch ioctls for allocating IPIs and mmap'ing that region etc...

For pass-through, it's trickier, we don't want to mmap each irqfd
individually for the above reason, so we want to "link" them to KVM. We
don't want to allow qemu to take control of any arbitrary interrupt in
the system though, so it has to related to the ownership of the irqfd
coming from vfio.

OpenCAPI I suspect will be its own can of worms...

Also, have we decided how the process of switching between XICS and
XIVE will work vs. CAS ? And how that will interact with KVM ? I was
thinking the kernel would implement a different KVM device type, ie
the "emulated XICS" would remain KVM_DEV_TYPE_XICS and XIVE would be
KVM_DEV_TYPE_XIVE.

Cheers,
Ben.





Re: [Qemu-devel] [PATCH 0/6] spapr: Add optional capabilities

2017-12-20 Thread Benjamin Herrenschmidt
On Mon, 2017-12-18 at 20:20 +1100, David Gibson wrote:
> This series is a first draft to add the notion of optional
> capabilities to the "pseries" machine type.  A default set of
> capabilities is selected based on the machine type version and
> selected cpu model, but this can be overridden with machine
> parameters.
> 
> The purpose of this is to get rid of a number of places where we
> implicitly decide what features to advertise to the guest based on
> capabilities of the host.  This is bad, because it means it's
> difficult to be certain if machines started at different ends of a
> migration really match from the guest's point of view.
> 
> By giving the user explicit control of these optional features, then
> validating that the chosen ones can be supplied on the host we make
> behaviour more predictable.
> 
> The more specific motivation for this is that POWER9 has bugs in its
> hardware transactional memory (HTM) implementation making it unsafe to
> migrate POWER8 guests to POWER9 if they use HTM

It might be worthwhile mentioning that we do plan to support TM and
migration from P8 for KVM guests at some point. Working on it ... there
are some SW workarounds involved in KVM to make it possible.

> Changes since RFC:
>  - Two preliminary patches not really related split off and already
>merged.
>  - Assorted minor fixes based on review notes
>  - Change defaults to disable HTM on pseries-2.12 machine type
> 
> David Gibson (6):
>   spapr: Capabilities infrastructure
>   spapr: Treat Hardware Transactional Memory (HTM) as an optional
> capability
>   spapr: Validate capabilities on migration
>   target/ppc: Clean up probing of VMX, VSX and DFP availability on KVM
>   spapr: Handle VMX/VSX presence as an spapr capability flag
>   spapr: Handle Decimal Floating Point (DFP) as an optional capability
> 
>  hw/ppc/Makefile.objs   |   2 +-
>  hw/ppc/spapr.c |  47 +--
>  hw/ppc/spapr_caps.c| 342 
> +
>  include/hw/ppc/spapr.h |  46 +++
>  target/ppc/kvm.c   |  27 +---
>  target/ppc/kvm_ppc.h   |   2 -
>  6 files changed, 429 insertions(+), 37 deletions(-)
>  create mode 100644 hw/ppc/spapr_caps.c
> 




Re: [Qemu-devel] [PATCH v2 03/19] spapr: introduce the XIVE interrupt sources

2017-12-17 Thread Benjamin Herrenschmidt
On Thu, 2017-12-14 at 16:24 +0100, Cédric Le Goater wrote:
> The API between the source and the IVRE is extremely simple :
> 
>   static void spapr_xive_irq(sPAPRXive *xive, int lisn)
> 
> The IVRE then scans its IVT, finds the EQ, and moves on to the 
> presenter.

In HW it's an MMIO store between the two units (from the source to the
IVRE notification port). I wonder in the long run if we should model
that the same way...

> So, we can keep the IVRE engine (sPAPRXive) attached directly to 
> the machine like we have today, this is good, and introduce multiple 
> XIVE source objects. The sPAPR machine would have : 
> 
>  - one for the IPIs [ 0 - nr_servers ]
>  - one generic for the devices [ 4096 -  ]
>  - one for each phb ? 
> 
> The source address in the overall ESB MMIO region would be calculated 
> from the offset of the source IRQ numbers in the IRQ number space. 
> The offset could very well be hardcoded for each device. I don't see 
> any XICS compatibility problems as we are sharing correctly the IRQ 
> number space already.
> 
> 
> I am starting this discussion because the support for XIVE in the 
> QEMU PowerNV machine will need multiple sources, just like for 
> POWER8. PnvXive will be a bit different because the IVRE tables 
> (IVT and EQDT) are in the virtual machine memory. Most of the settings 
> are done in the VM. The QEMU PowerNV machine will still have to 
> implement the triggering and the routing logic using the guest tables. 



Re: [Qemu-devel] [PATCH 19/25] spapr: add hcalls support for the XIVE interrupt mode

2017-12-06 Thread Benjamin Herrenschmidt
On Wed, 2017-12-06 at 20:20 +1100, David Gibson wrote:
> On Tue, Dec 05, 2017 at 08:50:26AM -0600, Benjamin Herrenschmidt wrote:
> > On Tue, 2017-12-05 at 18:00 +1100, David Gibson wrote:
> > > > The CPU revision. But we won't introduce XIVE exploitation mode on 
> > > > anything else than DD2.0 which has full XIVE support. Even STORE_EOI 
> > > > that we should be adding.
> > > 
> > > Hrm.  Host CPU?  That's a problem - if guest visible properties like
> > > this vary with the host CPU, migration breaks.
> > 
> > I don't think this is going to be a problem in practice. The
> > availability of trigger comes from OPAL but in practice, all virtual
> > interrupts are going to support it always,
> 
> Ok.  It still makes me nervous to derive guest visible features from
> the host.  I'd prefer to just hardwire the XIVE model to always/never
> advertise it and simply fail if that isn't workable for the host kernel.

We could fail loudly if we see an migratable interrupt that doesn't
have the flag.

Cheers,
Ben.




Re: [Qemu-devel] [PATCH 19/25] spapr: add hcalls support for the XIVE interrupt mode

2017-12-05 Thread Benjamin Herrenschmidt
On Tue, 2017-12-05 at 18:00 +1100, David Gibson wrote:
> > The CPU revision. But we won't introduce XIVE exploitation mode on 
> > anything else than DD2.0 which has full XIVE support. Even STORE_EOI 
> > that we should be adding.
> 
> Hrm.  Host CPU?  That's a problem - if guest visible properties like
> this vary with the host CPU, migration breaks.

I don't think this is going to be a problem in practice. The
availability of trigger comes from OPAL but in practice, all virtual
interrupts are going to support it always, only some of the HW
originated ones (PCIe MSIs for example and LSIs) won't. And we don't
migrate with PCIe devices passed through.

So the guest need the info, but we should be ok with migration from P9
DD2.0 onwards. Nobody sane cares about P9 DD1.0.

Any future chip will have to ensure that we don't lose that property in
HW least we lost migration, but that would break AIX too so I'm
reasonably confident the HW guys will get that right ;-)

> > 
> > > If it varies with host capabilities, that's going to be real pain for
> > > migration.
> > 
> > Yes. I am not aware of any future extension but I agree this is
> > something we need to keep an eye on.
> 
> I'm not talking about future extension, I'm meaning right now.

No, no issue right now.

Cheers,
Ben.




Re: [Qemu-devel] [PATCH 15/25] spapr: notify the CPU when the XIVE interrupt priority is more privileged

2017-12-04 Thread Benjamin Herrenschmidt
On Mon, 2017-12-04 at 12:17 +1100, David Gibson wrote:
> On Sat, Dec 02, 2017 at 08:40:58AM -0600, Benjamin Herrenschmidt wrote:
> > On Thu, 2017-11-30 at 16:00 +1100, David Gibson wrote:
> > > 
> > > >  static uint64_t spapr_xive_icp_accept(sPAPRXiveICP *icp)
> > > >  {
> > > > -return 0;
> > > > +uint8_t nsr = icp->tima_os[TM_NSR];
> > > > +
> > > > +qemu_irq_lower(icp->output);
> > > > +
> > > > +if (icp->tima_os[TM_NSR] & TM_QW1_NSR_EO) {
> > > > +uint8_t cppr = icp->tima_os[TM_PIPR];
> > > > +
> > > > +icp->tima_os[TM_CPPR] = cppr;
> > > > +
> > > > +/* Reset the pending buffer bit */
> > > > +icp->tima_os[TM_IPB] &= ~priority_to_ipb(cppr);
> > > 
> > > What if multiple irqs of the same priority were queued?
> > 
> > It's the job of the OS to handle that case by consuming from the queue
> > until it's empty. There is an MMIO the guest can use if it wants to
> > that can set the IPB bits back to 1 for a given priority. Otherwise in
> > Linux we just have a SW way to force a replay.
> 
> Ok, so "accept" is effectively saying the OS is accepting all
> interrupts from that queue, right?

It's whatever you want it to mean. It's simply a test & clear on the
prio bit. From a HW standpoint, you could have multiple queues or just
set an internal SW flag to go chck again later etc...

> 
> > 
> > > > +icp->tima_os[TM_PIPR] = ipb_to_pipr(icp->tima_os[TM_IPB]);
> > > > +
> > > > +/* Drop Exception bit for OS */
> > > > +icp->tima_os[TM_NSR] &= ~TM_QW1_NSR_EO;
> > > > +}
> > > > +
> > > > +return (nsr << 8) | icp->tima_os[TM_CPPR];
> > > > +}
> > > > +
> > > > +static void spapr_xive_icp_notify(sPAPRXiveICP *icp)
> > > > +{
> > > > +if (icp->tima_os[TM_PIPR] < icp->tima_os[TM_CPPR]) {
> > > > +icp->tima_os[TM_NSR] |= TM_QW1_NSR_EO;
> > > > +qemu_irq_raise(icp->output);
> > > > +}
> > > >  }
> > > >  
> > > >  static void spapr_xive_icp_set_cppr(sPAPRXiveICP *icp, uint8_t cppr)
> > > > @@ -51,6 +105,9 @@ static void spapr_xive_icp_set_cppr(sPAPRXiveICP 
> > > > *icp, uint8_t cppr)
> > > >  }
> > > >  
> > > >  icp->tima_os[TM_CPPR] = cppr;
> > > > +
> > > > +/* CPPR has changed, inform the ICP which might raise an exception 
> > > > */
> > > > +spapr_xive_icp_notify(icp);
> > > >  }
> > > >  
> > > >  /*
> > > > @@ -224,6 +281,8 @@ static void spapr_xive_irq(sPAPRXive *xive, int 
> > > > lisn)
> > > >  XiveEQ *eq;
> > > >  uint32_t eq_idx;
> > > >  uint8_t priority;
> > > > +uint32_t server;
> > > > +sPAPRXiveICP *icp;
> > > >  
> > > >  ive = spapr_xive_get_ive(xive, lisn);
> > > >  if (!ive || !(ive->w & IVE_VALID)) {
> > > > @@ -253,6 +312,13 @@ static void spapr_xive_irq(sPAPRXive *xive, int 
> > > > lisn)
> > > >  qemu_log_mask(LOG_UNIMP, "XIVE: !UCOND_NOTIFY not 
> > > > implemented\n");
> > > >  }
> > > >  
> > > > +server = GETFIELD(EQ_W6_NVT_INDEX, eq->w6);
> > > > +icp = spapr_xive_icp_get(xive, server);
> > > > +if (!icp) {
> > > > +qemu_log_mask(LOG_GUEST_ERROR, "XIVE: No ICP for server %d\n", 
> > > > server);
> > > > +return;
> > > > +}
> > > > +
> > > >  if (GETFIELD(EQ_W6_FORMAT_BIT, eq->w6) == 0) {
> > > >  priority = GETFIELD(EQ_W7_F0_PRIORITY, eq->w7);
> > > >  
> > > > @@ -260,9 +326,18 @@ static void spapr_xive_irq(sPAPRXive *xive, int 
> > > > lisn)
> > > >  if (priority == 0xff) {
> > > >  g_assert_not_reached();
> > > >  }
> > > > +
> > > > +/* Update the IPB (Interrupt Pending Buffer) with the priority
> > > > + * of the new notification and inform the ICP, which will
> > > > + * decide to raise the exception, or not, depending the CPPR.
> > > > + */
> > > > +icp->tima_os[TM_IPB] |= priority_to_ipb(priority);
> > > > +icp->tima_os[TM_PIPR] = ipb_to_pipr(icp->tima_os[TM_IPB]);
> > > >  } else {
> > > >  qemu_log_mask(LOG_UNIMP, "XIVE: w7 format1 not implemented\n");
> > > >  }
> > > > +
> > > > +spapr_xive_icp_notify(icp);
> > > >  }
> > > >  
> > > >  /*
> > > 
> > > 
> 
> 



Re: [Qemu-devel] [PATCH 11/25] spapr: describe the XIVE interrupt source flags

2017-12-02 Thread Benjamin Herrenschmidt
On Sat, 2017-12-02 at 15:38 +0100, Cédric Le Goater wrote:
> Hmm, yes. So, the current design for sPAPR handles all sources 
> under the same XIVE object with a global memory region for all 
> the ESBs. 
> 
> The first RFC had a mechanism to register source objects into 
> the XIVE main one, allocating the IRQs per source and mapping 
> the ESBs in the overall region. A bit like OPAL does. I then 
> simplified for the sake of clarity and merged everything under 
> the same XIVE object. 
> 
> Shall I reintroduce multiples sources support ? and provide a 
> default one for IPIs and virtual devices of the machine. 

That or you need state bits ;-)

Cheers,
Ben.




Re: [Qemu-devel] [PATCH 14/25] spapr: push the XIVE EQ data in OS event queue

2017-12-02 Thread Benjamin Herrenschmidt
On Sat, 2017-12-02 at 08:45 -0600, Benjamin Herrenschmidt wrote:
> On Fri, 2017-12-01 at 15:10 +1100, David Gibson wrote:
> > 
> > Hm, ok.  Guest endian (or at least, not definitively host-endian) data
> > in a plain uint32_t makes me uncomfortable.  Could we use char data[4]
> > instead, to make it clear it's a byte-ordered buffer, rather than a
> > number as far as the XIVE is concerned.
> > 
> > Hm.. except that doesn't quite work, because the hardware must define
> > which end that generation bit ends up in...
> 
> It also needs to be written atomically. Just say it's big endian.

Also the guest reads it using be32_to_cpup...

> 
> Cheers,
> Ben.
> 
> > > > > +qemu_log_mask(LOG_GUEST_ERROR, "%s: failed to write EQ data 
> > > > > @0x%"
> > > > > +  HWADDR_PRIx "\n", __func__, qaddr);
> > > > > +return;
> > > > > +}
> > > > > +
> > > > > +qindex = (qindex + 1) % qentries;
> > > > > +if (qindex == 0) {
> > > > > +qgen ^= 1;
> > > > > +eq->w1 = SETFIELD(EQ_W1_GENERATION, eq->w1, qgen);
> > > > > +}
> > > > > +eq->w1 = SETFIELD(EQ_W1_PAGE_OFF, eq->w1, qindex);
> > > > > +}
> > > > > +
> > > > >  static void spapr_xive_irq(sPAPRXive *xive, int lisn)
> > > > >  {
> > > > > +XiveIVE *ive;
> > > > > +XiveEQ *eq;
> > > > > +uint32_t eq_idx;
> > > > > +uint8_t priority;
> > > > > +
> > > > > +ive = spapr_xive_get_ive(xive, lisn);
> > > > > +if (!ive || !(ive->w & IVE_VALID)) {
> > > > > +qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid LISN %d\n", 
> > > > > lisn);
> > > > 
> > > > As mentioned on other patches, I'm a little concerned by these
> > > > guest-triggerable logs.  I guess the LOG_GUEST_ERROR mask will save
> > > > us, though.
> > > 
> > > I want to track 'invalid' interrupts but I haven't seen these show up 
> > > in my tests. I agree there are a little too much and some could just 
> > > be asserts.
> > 
> > Uh.. I don't think many can be assert()s.  assert() is only
> > appropriate if it being tripped definitely indicates a bug in qemu.
> > Nearly all these qemu_log()s I've seen can be tripped by the guest
> > doing something bad, which absolutely should not assert() qemu.




Re: [Qemu-devel] [PATCH 14/25] spapr: push the XIVE EQ data in OS event queue

2017-12-02 Thread Benjamin Herrenschmidt
On Fri, 2017-12-01 at 15:10 +1100, David Gibson wrote:
> 
> Hm, ok.  Guest endian (or at least, not definitively host-endian) data
> in a plain uint32_t makes me uncomfortable.  Could we use char data[4]
> instead, to make it clear it's a byte-ordered buffer, rather than a
> number as far as the XIVE is concerned.
> 
> Hm.. except that doesn't quite work, because the hardware must define
> which end that generation bit ends up in...

It also needs to be written atomically. Just say it's big endian.

Cheers,
Ben.

> > >> +qemu_log_mask(LOG_GUEST_ERROR, "%s: failed to write EQ data 
> > >> @0x%"
> > >> +  HWADDR_PRIx "\n", __func__, qaddr);
> > >> +return;
> > >> +}
> > >> +
> > >> +qindex = (qindex + 1) % qentries;
> > >> +if (qindex == 0) {
> > >> +qgen ^= 1;
> > >> +eq->w1 = SETFIELD(EQ_W1_GENERATION, eq->w1, qgen);
> > >> +}
> > >> +eq->w1 = SETFIELD(EQ_W1_PAGE_OFF, eq->w1, qindex);
> > >> +}
> > >> +
> > >>  static void spapr_xive_irq(sPAPRXive *xive, int lisn)
> > >>  {
> > >> +XiveIVE *ive;
> > >> +XiveEQ *eq;
> > >> +uint32_t eq_idx;
> > >> +uint8_t priority;
> > >> +
> > >> +ive = spapr_xive_get_ive(xive, lisn);
> > >> +if (!ive || !(ive->w & IVE_VALID)) {
> > >> +qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid LISN %d\n", lisn);
> > > 
> > > As mentioned on other patches, I'm a little concerned by these
> > > guest-triggerable logs.  I guess the LOG_GUEST_ERROR mask will save
> > > us, though.
> > 
> > I want to track 'invalid' interrupts but I haven't seen these show up 
> > in my tests. I agree there are a little too much and some could just 
> > be asserts.
> 
> Uh.. I don't think many can be assert()s.  assert() is only
> appropriate if it being tripped definitely indicates a bug in qemu.
> Nearly all these qemu_log()s I've seen can be tripped by the guest
> doing something bad, which absolutely should not assert() qemu.



Re: [Qemu-devel] [PATCH 13/25] spapr: introduce the XIVE Event Queues

2017-12-02 Thread Benjamin Herrenschmidt
On Sat, 2017-12-02 at 08:39 -0600, Benjamin Herrenschmidt wrote:
> The IVEs and EQs are managed by the virtualization controller. The VPs
> (aka ENDs) 

typo. aka NVTs

> are managed by the presentation controller. There's a VP per
> real and virtual CPU.




Re: [Qemu-devel] [PATCH 15/25] spapr: notify the CPU when the XIVE interrupt priority is more privileged

2017-12-02 Thread Benjamin Herrenschmidt
On Thu, 2017-11-30 at 16:00 +1100, David Gibson wrote:
> 
> >  static uint64_t spapr_xive_icp_accept(sPAPRXiveICP *icp)
> >  {
> > -return 0;
> > +uint8_t nsr = icp->tima_os[TM_NSR];
> > +
> > +qemu_irq_lower(icp->output);
> > +
> > +if (icp->tima_os[TM_NSR] & TM_QW1_NSR_EO) {
> > +uint8_t cppr = icp->tima_os[TM_PIPR];
> > +
> > +icp->tima_os[TM_CPPR] = cppr;
> > +
> > +/* Reset the pending buffer bit */
> > +icp->tima_os[TM_IPB] &= ~priority_to_ipb(cppr);
> 
> What if multiple irqs of the same priority were queued?

It's the job of the OS to handle that case by consuming from the queue
until it's empty. There is an MMIO the guest can use if it wants to
that can set the IPB bits back to 1 for a given priority. Otherwise in
Linux we just have a SW way to force a replay.

> > +icp->tima_os[TM_PIPR] = ipb_to_pipr(icp->tima_os[TM_IPB]);
> > +
> > +/* Drop Exception bit for OS */
> > +icp->tima_os[TM_NSR] &= ~TM_QW1_NSR_EO;
> > +}
> > +
> > +return (nsr << 8) | icp->tima_os[TM_CPPR];
> > +}
> > +
> > +static void spapr_xive_icp_notify(sPAPRXiveICP *icp)
> > +{
> > +if (icp->tima_os[TM_PIPR] < icp->tima_os[TM_CPPR]) {
> > +icp->tima_os[TM_NSR] |= TM_QW1_NSR_EO;
> > +qemu_irq_raise(icp->output);
> > +}
> >  }
> >  
> >  static void spapr_xive_icp_set_cppr(sPAPRXiveICP *icp, uint8_t cppr)
> > @@ -51,6 +105,9 @@ static void spapr_xive_icp_set_cppr(sPAPRXiveICP *icp, 
> > uint8_t cppr)
> >  }
> >  
> >  icp->tima_os[TM_CPPR] = cppr;
> > +
> > +/* CPPR has changed, inform the ICP which might raise an exception */
> > +spapr_xive_icp_notify(icp);
> >  }
> >  
> >  /*
> > @@ -224,6 +281,8 @@ static void spapr_xive_irq(sPAPRXive *xive, int lisn)
> >  XiveEQ *eq;
> >  uint32_t eq_idx;
> >  uint8_t priority;
> > +uint32_t server;
> > +sPAPRXiveICP *icp;
> >  
> >  ive = spapr_xive_get_ive(xive, lisn);
> >  if (!ive || !(ive->w & IVE_VALID)) {
> > @@ -253,6 +312,13 @@ static void spapr_xive_irq(sPAPRXive *xive, int lisn)
> >  qemu_log_mask(LOG_UNIMP, "XIVE: !UCOND_NOTIFY not implemented\n");
> >  }
> >  
> > +server = GETFIELD(EQ_W6_NVT_INDEX, eq->w6);
> > +icp = spapr_xive_icp_get(xive, server);
> > +if (!icp) {
> > +qemu_log_mask(LOG_GUEST_ERROR, "XIVE: No ICP for server %d\n", 
> > server);
> > +return;
> > +}
> > +
> >  if (GETFIELD(EQ_W6_FORMAT_BIT, eq->w6) == 0) {
> >  priority = GETFIELD(EQ_W7_F0_PRIORITY, eq->w7);
> >  
> > @@ -260,9 +326,18 @@ static void spapr_xive_irq(sPAPRXive *xive, int lisn)
> >  if (priority == 0xff) {
> >  g_assert_not_reached();
> >  }
> > +
> > +/* Update the IPB (Interrupt Pending Buffer) with the priority
> > + * of the new notification and inform the ICP, which will
> > + * decide to raise the exception, or not, depending the CPPR.
> > + */
> > +icp->tima_os[TM_IPB] |= priority_to_ipb(priority);
> > +icp->tima_os[TM_PIPR] = ipb_to_pipr(icp->tima_os[TM_IPB]);
> >  } else {
> >  qemu_log_mask(LOG_UNIMP, "XIVE: w7 format1 not implemented\n");
> >  }
> > +
> > +spapr_xive_icp_notify(icp);
> >  }
> >  
> >  /*
> 
> 



Re: [Qemu-devel] [PATCH 13/25] spapr: introduce the XIVE Event Queues

2017-12-02 Thread Benjamin Herrenschmidt
On Thu, 2017-11-30 at 15:38 +1100, David Gibson wrote:
> On Thu, Nov 23, 2017 at 02:29:43PM +0100, Cédric Le Goater wrote:
> > The Event Queue Descriptor (EQD) table, also known as Event Notification
> > Descriptor (END), is one of the internal tables the XIVE interrupt
> > controller uses to redirect exception from event sources to CPU
> > threads.
> > 
> > The EQD specifies on which Event Queue the event data should be posted
> > when an exception occurs (later on pulled by the OS) and which server
> > (VPD in XIVE terminology) to notify. The Event Queue is a much more
> > complex structure but we start with a simple model for the sPAPR
> > machine.
> 
> Just to clarify my understanding a server / VPD in XIVE would
> typically correspond to a cpu - either real or virtual, yes?

The IVEs and EQs are managed by the virtualization controller. The VPs
(aka ENDs) are managed by the presentation controller. There's a VP per
real and virtual CPU.

You can think of the XIVE as having 3 main component types:


 - Source controller(s). There are some in the PHBs, one generic in the
XIVE itself, and one in the PSI bridge. Those contain the PQ bits and
thus the trigger & coalescing logic. They effectively shoot an MMIO to
the virtualization controller on events.

 - Virtualization controller (one per chip). This receives the above
MMIOs from the sources, manages the IVEs to get the target queue and
remap the number, and manages the queues. When a queue is enabled for
notification (or escalation) and such an event occurs, an MMIO goes to
the corresponding presentation controller.

 - Presentation controller (one per chip). This receives the above
notifications and sets as a result the IPB bits for one of the 8
priorities. Basically this guy tracks a single pending bit per priority
for each VP indicating whether there's something in the queue for that
priority and delivers interrupts to the core accordingly.


Now this is a simplified view. The PC supports groups but we don't
handle that yet, there are escalation interrupts, there are
redistribution mechanisms etc... but for now you get the basic idea.

Cheers,
Ben.




Re: [Qemu-devel] [PATCH 10/25] spapr: add MMIO handlers for the XIVE interrupt sources

2017-12-02 Thread Benjamin Herrenschmidt
On Thu, 2017-11-30 at 15:28 +1100, David Gibson wrote:
> 
> How does this work at the hardware level?  Presumbly the actual
> hardware components don't communicate with the XIVE to request edge or
> level.  So how does it know?  Specific ranges for LSIs?  If that we
> should probably do the same.

So the source controller and the IVE are separate. The source
controller sends an internal MMIO to the IVE for "translating" the
event into a queue etc...

The IVE only see "events" which are effectively state transitions of
the P bit of the source.

The LSI vs MSI difference is thus entirely a property of the source
HW.

All the XIVE "Generic" built-in sources (the ones you can trigger with
an MMIO, which we use in KVM for all the IPIs and virtual interrupts)
are MSIs.

You find 2 kind of blocks of LSIs in the chip, the one PSI block which
has a handful or two of LSI sources for random "stuff" (LPC
interrupt(s), i2c interrupts etc..) and the LSI blocks which are in
each PHB.

So the PHB has basically two different bits of logic, one for LSIs and
one for MSIs. Their HW state machine is different.

In fact in the PHB and the PSI, I think, there's even an MMIO backdoor
register that allows you to see the "state" of the LSI (asserted).

Cheers,
Ben.




Re: [Qemu-devel] [PATCH 10/25] spapr: add MMIO handlers for the XIVE interrupt sources

2017-12-02 Thread Benjamin Herrenschmidt
On Wed, 2017-11-29 at 17:23 +0100, Cédric Le Goater wrote:
> On 11/29/2017 02:56 PM, Cédric Le Goater wrote:
> > > > > > +switch (offset) {
> > > > > > +case 0:
> > > > > > +spapr_xive_source_eoi(xive, lisn);
> > > > > 
> > > > > Hrm.  I don't love that you're dealing with clearing that LSI bit
> > > > > here, but setting it at a different level.
> > > > > 
> > > > > The state machines are doing my head in a bit, is there any way
> > > > > you could derive the STATUS_SENT bit from the PQ bits?
> > > > 
> > > > Yes. I should. 
> > > > 
> > > > I am also lacking a guest driver to exercise these LSIs so I didn't
> > > > pay a lot of attention to level interrupts. Any idea ?
> > > 
> > > How about an old-school emulated PCI device?  Maybe rtl8139?
> > 
> > Perfect. The current model is working but I will see how I can 
> > improve it to use the PQ bits instead.
> 
> Using the PQ bits is simplifying the model but we still have to 
> maintain an array to store the IRQ type. 
> 
> There are 3 unused bits in the IVE descriptor, bits[1-3]:  
> 
>   #define IVE_VALID   PPC_BIT(0)
>   #define IVE_EQ_BLOCKPPC_BITMASK(4, 7)/* Destination EQ block# */
>   #define IVE_EQ_INDEXPPC_BITMASK(8, 31)   /* Destination EQ index */
>   #define IVE_MASKED  PPC_BIT(32)  /* Masked */
>   #define IVE_EQ_DATA PPC_BITMASK(33, 63)  /* Data written to the EQ 
> */
> 
> We could hijack one of them to store the LSI type and get rid of 
> the type array. Would you object to that ? 

This won't work well if/when you implement a real HW XIVE.

Another option is to have different source objects for LSIs and MSIs.

Cheers,
Ben.
> 
> C.



Re: [Qemu-devel] [PATCH 09/25] spapr: introduce handlers for XIVE interrupt sources

2017-12-02 Thread Benjamin Herrenschmidt
On Tue, 2017-11-28 at 18:18 +, Cédric Le Goater wrote:
> AFAICT, it doesn't. LSI events are configured as the other XIVE interrupts. 
> The level is converted in the P bit and the Q bit should always be zero.
> So I should be able to simplify the proposed model which still is mimicking 
> XICS  ... I will take a look at it. 
> 
> There are a sort of special degenerated LSIs but these are for bringup.

Not really. So for MSIs you don't need your state flags.

For LSIs, you do need the "asserted" one that keeps track of the LSI
input state. See my other note with the actual states.

Cheers,
Ben.




Re: [Qemu-devel] [PATCH 11/25] spapr: describe the XIVE interrupt source flags

2017-12-02 Thread Benjamin Herrenschmidt
On Tue, 2017-11-28 at 17:40 +1100, David Gibson wrote:
> > @@ -368,6 +368,10 @@ static void spapr_xive_realize(DeviceState *dev, Error 
> > **errp)
> >  /* Allocate the IVT (Interrupt Virtualization Table) */
> >  xive->ivt = g_malloc0(xive->nr_irqs * sizeof(XiveIVE));
> >  
> > +/* All sources are emulated under the XIVE object and share the
> > + * same characteristic */
> > +xive->flags = XIVE_SRC_TRIGGER;
> 
> You never actually use this field.  And since it always has the same
> value, is there a point to storing it?

Some HW sources don't have it, so with pass-through maybe...

Cheers,
Ben




Re: [Qemu-devel] [PATCH 10/25] spapr: add MMIO handlers for the XIVE interrupt sources

2017-12-02 Thread Benjamin Herrenschmidt
On Tue, 2017-11-28 at 17:38 +1100, David Gibson wrote:
> Hrm.  I don't love that you're dealing with clearing that LSI bit
> here, but setting it at a different level.
> 
> The state machines are doing my head in a bit, is there any way
> you could derive the STATUS_SENT bit from the PQ bits?

Yeah it should be...

So you should normally need only one extra bit of state for LSI which
is whether it's asserted or not and no extra bit of state for MSIs.

P is basically "sent". Q is whether another event has been queued up
(and is only meaningful for MSIs though the 01 combination will mask
LSIs too).

The state logic should be for MSIs on event:

- if PQ=01 ignore (masked)
- if P=1, set Q and finish
- set P=1 and forward event to IVE

For EOI (load and store):

- if PQ=01 ignore
- P=Q, Q=0
- (storeEOI only) if new P=1, forward event to IVE
 
For LSIs, and "event" is whenever the state is asserted, and Q is
meaningless, so basically on every change of state or ESB:

- if PQ=01 ignore (masked)
- if P=1 finish
- set P=1 and forward event to IVE

For EOI (load and store):

- if PQ=01 ignore
- clear P
- re-evaluate as above if asserted

Cheers,
Ben.




Re: [Qemu-devel] [PATCH 13/25] spapr: introduce the XIVE Event Queues

2017-11-26 Thread Benjamin Herrenschmidt
On Fri, 2017-11-24 at 09:15 +0100, Cédric Le Goater wrote:
> So The Linux driver is expected to choose priority 6. The priority
> validity is then checked in each hcall returning H_P4/H_P3 in case of 
> failure.  
> 
> But it is true that we scale the arrays with :
>  
> #define XIVE_PRIORITY_MAX  7
> 
> Do you want QEMU to completely remove prio 7 ? 

I'd like qemu to be consistent, at least make sure it errors out if the
OS tries to configue prio 7 or route an irq to it.

Cheers,
Ben.



Re: [Qemu-devel] [PATCH 13/25] spapr: introduce the XIVE Event Queues

2017-11-23 Thread Benjamin Herrenschmidt
On Thu, 2017-11-23 at 14:29 +0100, Cédric Le Goater wrote:
> The Event Queue Descriptor (EQD) table, also known as Event Notification
> Descriptor (END), is one of the internal tables the XIVE interrupt
> controller uses to redirect exception from event sources to CPU
> threads.

Keep in mind tha we want to only expose to the guest priorities 0..6
as priority 7 will not be available with KVM on DD2.0 chips.

> The EQD specifies on which Event Queue the event data should be posted
> when an exception occurs (later on pulled by the OS) and which server
> (VPD in XIVE terminology) to notify. The Event Queue is a much more
> complex structure but we start with a simple model for the sPAPR
> machine.
> 
> There is one XiveEQ per priority and the model chooses to store them
> under the Xive Interrupt presenter model. It will be retrieved, just
> like for XICS, through the 'intc' object pointer of the CPU.
> 
> The EQ indexing follows a simple pattern:
> 
>(server << 3) | (priority & 0x7)
> 
> Signed-off-by: Cédric Le Goater 
> ---
>  hw/intc/spapr_xive.c| 56 
> +
>  hw/intc/xive-internal.h | 50 +++
>  2 files changed, 106 insertions(+)
> 
> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
> index 554b25e0884c..983317a6b3f6 100644
> --- a/hw/intc/spapr_xive.c
> +++ b/hw/intc/spapr_xive.c
> @@ -23,6 +23,7 @@
>  #include "sysemu/dma.h"
>  #include "monitor/monitor.h"
>  #include "hw/ppc/spapr_xive.h"
> +#include "hw/ppc/spapr.h"
>  #include "hw/ppc/xics.h"
>  
>  #include "xive-internal.h"
> @@ -34,6 +35,8 @@ struct sPAPRXiveICP {
>  uint8_t   tima[TM_RING_COUNT * 0x10];
>  uint8_t   *tima_os;
>  qemu_irq  output;
> +
> +XiveEQeqt[XIVE_PRIORITY_MAX + 1];
>  };
>  
>  static uint64_t spapr_xive_icp_accept(sPAPRXiveICP *icp)
> @@ -183,6 +186,13 @@ static const MemoryRegionOps spapr_xive_tm_ops = {
>  },
>  };
>  
> +static sPAPRXiveICP *spapr_xive_icp_get(sPAPRXive *xive, int server)
> +{
> +PowerPCCPU *cpu = spapr_find_cpu(server);
> +
> +return cpu ? SPAPR_XIVE_ICP(cpu->intc) : NULL;
> +}
> +
>  static void spapr_xive_irq(sPAPRXive *xive, int lisn)
>  {
>  
> @@ -632,6 +642,8 @@ static void spapr_xive_icp_reset(void *dev)
>  sPAPRXiveICP *xicp = SPAPR_XIVE_ICP(dev);
>  
>  memset(xicp->tima, 0, sizeof(xicp->tima));
> +
> +memset(xicp->eqt, 0, sizeof(xicp->eqt));
>  }
>  
>  static void spapr_xive_icp_realize(DeviceState *dev, Error **errp)
> @@ -683,6 +695,23 @@ static void spapr_xive_icp_init(Object *obj)
>  xicp->tima_os = >tima[TM_QW1_OS];
>  }
>  
> +static const VMStateDescription vmstate_spapr_xive_icp_eq = {
> +.name = TYPE_SPAPR_XIVE_ICP "/eq",
> +.version_id = 1,
> +.minimum_version_id = 1,
> +.fields = (VMStateField []) {
> +VMSTATE_UINT32(w0, XiveEQ),
> +VMSTATE_UINT32(w1, XiveEQ),
> +VMSTATE_UINT32(w2, XiveEQ),
> +VMSTATE_UINT32(w3, XiveEQ),
> +VMSTATE_UINT32(w4, XiveEQ),
> +VMSTATE_UINT32(w5, XiveEQ),
> +VMSTATE_UINT32(w6, XiveEQ),
> +VMSTATE_UINT32(w7, XiveEQ),
> +VMSTATE_END_OF_LIST()
> +},
> +};
> +
>  static bool vmstate_spapr_xive_icp_needed(void *opaque)
>  {
>  /* TODO check machine XIVE support */
> @@ -696,6 +725,8 @@ static const VMStateDescription vmstate_spapr_xive_icp = {
>  .needed = vmstate_spapr_xive_icp_needed,
>  .fields = (VMStateField[]) {
>  VMSTATE_BUFFER(tima, sPAPRXiveICP),
> +VMSTATE_STRUCT_ARRAY(eqt, sPAPRXiveICP, (XIVE_PRIORITY_MAX + 1), 1,
> + vmstate_spapr_xive_icp_eq, XiveEQ),
>  VMSTATE_END_OF_LIST()
>  },
>  };
> @@ -755,3 +786,28 @@ bool spapr_xive_irq_unset(sPAPRXive *xive, uint32_t lisn)
>  ive->w &= ~IVE_VALID;
>  return true;
>  }
> +
> +/*
> + * Use a simple indexing for the EQs.
> + */
> +XiveEQ *spapr_xive_get_eq(sPAPRXive *xive, uint32_t eq_idx)
> +{
> +int priority = eq_idx & 0x7;
> +sPAPRXiveICP *xicp = spapr_xive_icp_get(xive, eq_idx >> 3);
> +
> +return xicp ? >eqt[priority] : NULL;
> +}
> +
> +bool spapr_xive_eq_for_server(sPAPRXive *xive, uint32_t server,
> +  uint8_t priority, uint32_t *out_eq_idx)
> +{
> +if (priority > XIVE_PRIORITY_MAX) {
> +return false;
> +}
> +
> +if (out_eq_idx) {
> +*out_eq_idx = (server << 3) | (priority & 0x7);
> +}
> +
> +return true;
> +}
> diff --git a/hw/intc/xive-internal.h b/hw/intc/xive-internal.h
> index 7d329f203a9b..c3949671aa03 100644
> --- a/hw/intc/xive-internal.h
> +++ b/hw/intc/xive-internal.h
> @@ -131,9 +131,59 @@ typedef struct XiveIVE {
>  #define IVE_EQ_DATA PPC_BITMASK(33, 63)  /* Data written to the EQ */
>  } XiveIVE;
>  
> +/* EQ */
> +typedef struct XiveEQ {
> +uint32_tw0;
> +#define EQ_W0_VALID PPC_BIT32(0)
> +#define EQ_W0_ENQUEUE   PPC_BIT32(1)
> 

Re: [Qemu-devel] [PATCH v2 2/4] spapr/rtas: disable the decrementer interrupt when a CPU is unplugged

2017-10-10 Thread Benjamin Herrenschmidt
On Mon, 2017-10-09 at 17:49 +0200, Cédric Le Goater wrote:
> When a CPU is stopped with the 'stop-self' RTAS call, its state
> 'halted' is switched to 1 and, in this case, the MSR is not taken into
> account anymore in the cpu_has_work() routine. Only the pending
> hardware interrupts are checked with their LPCR:PECE* enablement bit.
> 
> If the DECR timer fires after 'stop-self' is called and before the CPU
> 'stop' state is reached, the nearly-dead CPU will have some work to do
> and the guest will crash. This case happens very frequently with the
> not yet upstream P9 XIVE exploitation mode. In XICS mode, the DECR is
> occasionally fired but after 'stop' state, so no work is to be done
> and the guest survives.
> 
> I suspect there is a race between the QEMU mainloop triggering the
> timers and the TCG CPU thread but I could not quite identify the root
> cause. To be safe, let's disable the decrementer interrupt in the LPCR
> when the CPU is halted and reenable it when the CPU is restarted.
> 
> Signed-off-by: Cédric Le Goater 

We should disable external interrupts and doorbells too no ? IE, we
could clear all of PECE in fact.

> ---
> 
> Changes in v2:
> 
>  - used a new routine ppc_cpu_pvr_match() to discriminate CPU versions
>  - removed the LPCR:PECE* enablement bit when the CPU is initialized
>if it is a secondary
> 
>  hw/ppc/spapr_rtas.c | 20 
>  target/ppc/translate_init.c | 19 +--
>  2 files changed, 37 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/ppc/spapr_rtas.c b/hw/ppc/spapr_rtas.c
> index cdf0b607a0a0..dfdbf1e2c6f8 100644
> --- a/hw/ppc/spapr_rtas.c
> +++ b/hw/ppc/spapr_rtas.c
> @@ -46,6 +46,7 @@
>  #include "qemu/cutils.h"
>  #include "trace.h"
>  #include "hw/ppc/fdt.h"
> +#include "target/ppc/cpu-models.h"
>  
>  static void rtas_display_character(PowerPCCPU *cpu, sPAPRMachineState *spapr,
> uint32_t token, uint32_t nargs,
> @@ -174,6 +175,15 @@ static void rtas_start_cpu(PowerPCCPU *cpu_, 
> sPAPRMachineState *spapr,
>  kvm_cpu_synchronize_state(cs);
>  
>  env->msr = (1ULL << MSR_SF) | (1ULL << MSR_ME);
> +
> +/* Enable DECR interrupt */
> +if (ppc_cpu_pvr_match(cpu, CPU_POWERPC_LOGICAL_3_00)) {
> +env->spr[SPR_LPCR] |= LPCR_DEE;
> +} else {
> +/* P7 and P8 both have same bit for DECR */
> +env->spr[SPR_LPCR] |= LPCR_P8_PECE3;
> +}
> +
>  env->nip = start;
>  env->gpr[3] = r3;
>  cs->halted = 0;
> @@ -210,6 +220,16 @@ static void rtas_stop_self(PowerPCCPU *cpu, 
> sPAPRMachineState *spapr,
>   * no need to bother with specific bits, we just clear it.
>   */
>  env->msr = 0;
> +
> +/* Don't let the decremeter run on a CPU being stopped. This could
> + * deliver an interrupt on a dying CPU and crash the guest.
> + */
> +if (ppc_cpu_pvr_match(cpu, CPU_POWERPC_LOGICAL_3_00)) {
> +env->spr[SPR_LPCR] &= ~LPCR_DEE;
> +} else {
> +/* P7 and P8 both have same bit for DECR */
> +env->spr[SPR_LPCR] &= ~LPCR_P8_PECE3;
> +}
>  }
>  
>  static inline int sysparm_st(target_ulong addr, target_ulong len,
> diff --git a/target/ppc/translate_init.c b/target/ppc/translate_init.c
> index 0d6379fcc5b4..1a62159843e7 100644
> --- a/target/ppc/translate_init.c
> +++ b/target/ppc/translate_init.c
> @@ -8905,6 +8905,7 @@ void cpu_ppc_set_papr(PowerPCCPU *cpu, 
> PPCVirtualHypervisor *vhyp)
>  CPUPPCState *env = >env;
>  ppc_spr_t *lpcr = >spr_cb[SPR_LPCR];
>  ppc_spr_t *amor = >spr_cb[SPR_AMOR];
> +CPUState *cs = CPU(cpu);
>  
>  cpu->vhyp = vhyp;
>  
> @@ -8946,8 +8947,15 @@ void cpu_ppc_set_papr(PowerPCCPU *cpu, 
> PPCVirtualHypervisor *vhyp)
>  } else {
>  lpcr->default_value &= ~(LPCR_UPRT | LPCR_GTSE);
>  }
> -lpcr->default_value |= LPCR_PDEE | LPCR_HDEE | LPCR_EEE | LPCR_DEE |
> +lpcr->default_value |= LPCR_PDEE | LPCR_HDEE | LPCR_EEE |
> LPCR_OEE;
> +
> +/* Only let the decremeter wake up the boot CPU. The RTAS
> + * command start-cpu will enable it on secondaries.
> + */
> +if (cs == first_cpu) {
> +lpcr->default_value |= LPCR_DEE;
> +}
>  break;
>  default:
>  /* P7 and P8 has slightly different PECE bits, mostly because P8 adds
> @@ -8955,7 +8963,14 @@ void cpu_ppc_set_papr(PowerPCCPU *cpu, 
> PPCVirtualHypervisor *vhyp)
>   * will work as expected for both implementations
>   */
>  lpcr->default_value |= LPCR_P8_PECE0 | LPCR_P8_PECE1 | LPCR_P8_PECE2 
> |
> -   LPCR_P8_PECE3 | LPCR_P8_PECE4;
> +   LPCR_P8_PECE4;
> +
> +/* Only let the decremeter wake up the boot CPU. The RTAS
> + * command start-cpu will enable it on secondaries.
> + */
> +if (cs == 

Re: [Qemu-devel] [PATCH 1/2] spapr/rtas: disable the decrementer interrupt when a CPU is unplugged

2017-10-06 Thread Benjamin Herrenschmidt
On Fri, 2017-10-06 at 20:07 +1100, David Gibson wrote:
> Hm.  Checking mmu_model doesn't seem right to me.  I mean, it'll get
> the right answer in practice, but the LPCR programming has nothing
> whatsoever to do with the MMU.
> 
> I think explicitly checking if cpu_ is a POWER9 instance with
> object_dynamic_cast would be a better option.

Best is ARCH 300 ... do we hvae arch versions outside of MMU model
these days ?

Ben.




Re: [Qemu-devel] [PATCH 0/2] disable the decrementer interrupt when a CPU is unplugged

2017-10-06 Thread Benjamin Herrenschmidt
On Fri, 2017-10-06 at 11:40 +0530, Nikunj A Dadhania wrote:
> Cédric Le Goater  writes:
> 
> > Hello,
> > 
> > When a CPU is stopped with the 'stop-self' RTAS call, its state
> > 'halted' is switched to 1 and, in this case, the MSR is not taken into
> > account anymore in the cpu_has_work() routine. Only the pending
> > hardware interrupts are checked with their LPCR:PECE* enablement bit.
> > 
> > If the DECR timer fires after 'stop-self' is called and before the CPU
> > 'stop' state is reached, the nearly-dead CPU will have some work to do
> > and the guest will crash. This case happens very frequently with the
> > not yet upstream P9 XIVE exploitation mode. In XICS mode, the DECR is
> > occasionally fired but after 'stop' state, so no work is to be done
> > and the guest survives.
> > 
> > I suspect there is a race between the QEMU mainloop triggering the
> > timers and the TCG CPU thread but I could not quite identify the root
> > cause. To be safe, let's disable the decrementer interrupt in the LPCR
> > when the CPU is halted and reenable it when the CPU is restarted.
> 
> Moreover, disabling the DECR in the reset path solves the TCG multi cpu
> reboot case, as reboot path does not call stop-cpu rtas call.

SHouldn't we do it in set_papr too and only turn it on for the boot CPU
and in start-cpu RTAS call ? Same with the other PECEs in fact...

> diff --git a/hw/ppc/spapr_cpu_core.c b/hw/ppc/spapr_cpu_core.c
> index 3e20b1d886..c5150ee590 100644
> --- a/hw/ppc/spapr_cpu_core.c
> +++ b/hw/ppc/spapr_cpu_core.c
> @@ -86,6 +86,15 @@ static void spapr_cpu_reset(void *opaque)
>  cs->halted = 1;
>  
>  env->spr[SPR_HIOR] = 0;
> +/* Disable DECR for secondary cpus */
> +if (cs != first_cpu) {
> +if (env->mmu_model == POWERPC_MMU_3_00) {
> +env->spr[SPR_LPCR] &= ~LPCR_DEE;
> +} else {
> +/* P7 and P8 both have same bit for DECR */
> +env->spr[SPR_LPCR] &= ~LPCR_P8_PECE3;
> +}
> +}
>  }
>  
>  static void spapr_cpu_destroy(PowerPCCPU *cpu)
> 
> 
> Regards
> Nikunj



Re: [Qemu-devel] [RFC PATCH v2 18/21] ppc/xive: add device tree support

2017-09-28 Thread Benjamin Herrenschmidt
On Thu, 2017-09-28 at 10:51 +0200, Cédric Le Goater wrote:
> probably, I just removed the properties under QEMU and could 
> boot the guest, with disks and network.

As long as you don't use LSIs...
>  
> > Do you have a DT snapshot from pHyp for me to look at ?
> 
> 
> # lsprop /proc/device-tree/interrupt-controller\@20001/
> compatible   "ibm,power-ivpe"
> device_type  "power-ivpe"
> ibm,xive-eq-sizes
>  0007 0009 000c 000e
>  0010 0012 0015 0016
>  0018
> reg  0002 0001  0001
>  0002   0001
> linux,phandle00dce438 (14476344)
> ibm,xive-lisn-ranges
>  00094000 0030
> name "interrupt-controller"
> 
> 
> Cheers,



Re: [Qemu-devel] [RFC PATCH v2 18/21] ppc/xive: add device tree support

2017-09-28 Thread Benjamin Herrenschmidt
On Thu, 2017-09-21 at 11:35 +1000, David Gibson wrote:
> > >> +_FDT(fdt_setprop_string(fdt, node, "device_type", "power-ivpe"));
> > >> +_FDT(fdt_setprop(fdt, node, "reg", timas, sizeof(timas)));
> > >> +
> > >> +_FDT(fdt_setprop_string(fdt, node, "compatible", "ibm,power-ivpe"));
> > >> +_FDT(fdt_setprop(fdt, node, "ibm,xive-eq-sizes", eq_sizes,
> > >> + sizeof(eq_sizes)));
> > >> +_FDT(fdt_setprop(fdt, node, "ibm,xive-lisn-ranges", lisn_ranges,
> > >> + sizeof(lisn_ranges)));
> > > 
> > > I note this doesn't have the interrupt-controller or #interrupt-cells
> > > properties.  So what acts as the interrupt parent for all the devices
> > > in the tree with XIVE?
> > 
> > these properties are not in the specs anymore for the interrupt-controller
> > node and I don't think Linux makes use of them (even for XICS). So 
> > it just works fine.
> 
> Um.. what!?  Are you saying that the PAPR XIVE spec completely broke
> how interrupt specifiers have worked in the device tree since forever?
> 
> And I'm pretty sure Linux does make use of them.  Without
> #interrupt-cells, there's no way it can properly interpret the
> interrupts properties in the device nodes.

Linux does make use of them and they are in the spec, but don't confuse
the nodes for the presentation controllers vs the node for the virtual
source controller which is the one that is the root of the interrupt
tree.

Cheers,
Ben.




Re: [Qemu-devel] [RFC PATCH v2 18/21] ppc/xive: add device tree support

2017-09-28 Thread Benjamin Herrenschmidt
On Thu, 2017-09-21 at 13:21 +0200, Cédric Le Goater wrote:
> Let me be more precise. I am saying that the interrupt-controller 
> and #interrupt-cells properties are not needed under the main interrupt 
> controller node. They can be removed from the tree and the Linux guest 
> kernel will boot perfectly well.

No they are needed. They are the parents of PCI interrupts for example.
There's something fishy here.

Do you have a DT snapshot from pHyp for me to look at ?

> These properties still are needed under the sub nodes like :
> 
> /proc/device-tree/vdevice/interrupt-controller
> /proc/device-tree/event-sources/interrupt-controller



Re: [Qemu-devel] [RFC PATCH v2 07/21] ppc/xive: add MMIO handlers for the XIVE interrupt sources

2017-09-28 Thread Benjamin Herrenschmidt
On Wed, 2017-09-20 at 15:05 +0200, Cédric Le Goater wrote:
> > > +/*
> > > + * XIVE Interrupt Source MMIOs
> > > + */
> > > +static uint64_t spapr_xive_esb_read(void *opaque, hwaddr addr, unsigned 
> > > size)
> > > +{
> > > +sPAPRXive *xive = SPAPR_XIVE(opaque);
> > > +uint32_t offset = addr & 0xF00;
> > > +uint32_t srcno = addr >> xive->esb_shift;
> > > +XiveIVE *ive;
> > > +uint64_t ret = -1;
> > > +
> > > +ive = spapr_xive_get_ive(xive, srcno);
> > > +if (!ive || !(ive->w & IVE_VALID))  {
> > > +qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid LISN %d\n", srcno);
> > > +goto out;
> > 
> > Since there's a whole (4k) page for each source, I wonder if we should
> > actually map each one as a separate MMIO region to allow us to tweak
> > the mappings more flexibly
> 
> yes we could have a subregion for each source. In that case, 
> we should also handle IVE_VALID properly. That will require 
> a specific XIVE allocator which was difficult to do while
> keeping the compatibility with XICS for migration and CAS.

That will be a serious bloat with lots of interrupts. We also cannot
possibly have a KVM mm region per interrupt or even a vma.

I'm thinking of some kind of /dev/xive (or some other KVM or irqfd
orignated fd) that allows you to mmap a single big region whose content
is demand-faulted and invalidated by the kernel to map the various
interrupts.

So that it looks like a single VMA (and KVM memory block).

Ben.

> C.
> 
> 



Re: [Qemu-devel] [RFC PATCH v2 07/21] ppc/xive: add MMIO handlers for the XIVE interrupt sources

2017-09-28 Thread Benjamin Herrenschmidt
On Wed, 2017-09-20 at 14:54 +0200, Cédric Le Goater wrote:
> On 09/19/2017 04:57 AM, David Gibson wrote:
> > On Mon, Sep 11, 2017 at 07:12:21PM +0200, Cédric Le Goater wrote:
> > > Each interrupt source is associated with a two bit state machine
> > > called an Event State Buffer (ESB) which is controlled by MMIO to
> > > trigger events. See code for more details on the states and
> > > transitions.
> > > 
> > > The MMIO space for the ESB translation is 512GB large on baremetal
> > > (powernv) systems and the BAR depends on the chip id. In our model for
> > > the sPAPR machine, we choose to only map a sub memory region for the
> > > provisionned IRQ numbers and to use the mapping address of chip 0 on a
> > > real system. The OS will get the address of the MMIO page of the ESB
> > > entry associated with an IRQ using the H_INT_GET_SOURCE_INFO hcall.
> > 
> > On bare metal, are the MMIOs for each irq source mapped contiguously?
> 
> yes. 

Sort-of...

There are several source "controllers" in the system. Each PHB gets a
range of numbers, and the XIVE itself has about half a million of
generic sources (aka 'IPIs') which we use for IPIs and virtual device
interrupts.

Each of those "controller" has its own MMIO area. So the MMIOs are only
mapped contiguously within a given controller.

With pass-through, things are a bit more complex because a given guest
visible source can become either an IPI (when not attached to the host
interrupt) or a real HW source. So we'll have to invalidate the GPA-
>HVA mapping and remap. Tricky (but doable). I have some ideas about
how to plumb all that but haven't really fully thought it out.

> > > For KVM support, we should think of a way to map this QEMU memory
> > > region in the host to trigger events directly.
> > 
> > This would rely on being able to map them without mapping those for
> > any other VM or the host.  Does that mean allocating a contiguous (and
> > aligned) hunk of irqs for a guest?
> 
> I think so yes, the IRQ and the memory regions are tied, and also being 
> able to pass the MMIO region from the host to the guest, a bit like VFIO 
> for the IOMMU regions I suppose. But I haven't dig the problem too much. 
> 
> This is an important part in the overall design. 

There are also MMIO regions associated with queues.

> > We're going to need to be careful about irq allocation here.
> > Even though GET_SOURCE_INFO allows dynamic mapping of irq numbers to
> > MMIO addresses, 

> GET_SOURCE_INFO only retrieves the address of the MMIO region for 
> a 'lisn'.

An interrupt number as coming from the device-tree.

>  it is not dynamically mapped. In the KVM case, the initial
> information on the address would come from OPAL and then the host 
> kernel would translate this information for the guest.
> 
> > we need the MMIO addresses to be stable and consistent, because 
> > we can't have them change across migration.  
> 
> yes. I will catch my XIVE guru next week in Paris to clarify that
> part. 
> 
> > We need to have this consistent between in-qemu and in-KVM XIVE
> > implementations as well.
> 
> yes.
> 
> C.
> 
> > > 
> > > Signed-off-by: Cédric Le Goater 
> > > ---
> > >  hw/intc/spapr_xive.c| 255 
> > > 
> > >  include/hw/ppc/spapr_xive.h |   6 ++
> > >  2 files changed, 261 insertions(+)
> > > 
> > > diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
> > > index 1ed7b6a286e9..8a85d64efc4c 100644
> > > --- a/hw/intc/spapr_xive.c
> > > +++ b/hw/intc/spapr_xive.c
> > > @@ -33,6 +33,218 @@ static void spapr_xive_irq(sPAPRXive *xive, int srcno)
> > >  }
> > >  
> > >  /*
> > > + * "magic" Event State Buffer (ESB) MMIO offsets.
> > > + *
> > > + * Each interrupt source has a 2-bit state machine called ESB
> > > + * which can be controlled by MMIO. It's made of 2 bits, P and
> > > + * Q. P indicates that an interrupt is pending (has been sent
> > > + * to a queue and is waiting for an EOI). Q indicates that the
> > > + * interrupt has been triggered while pending.
> > > + *
> > > + * This acts as a coalescing mechanism in order to guarantee
> > > + * that a given interrupt only occurs at most once in a queue.
> > > + *
> > > + * When doing an EOI, the Q bit will indicate if the interrupt
> > > + * needs to be re-triggered.
> > > + *
> > > + * The following offsets into the ESB MMIO allow to read or
> > > + * manipulate the PQ bits. They must be used with an 8-bytes
> > > + * load instruction. They all return the previous state of the
> > > + * interrupt (atomically).
> > > + *
> > > + * Additionally, some ESB pages support doing an EOI via a
> > > + * store at 0 and some ESBs support doing a trigger via a
> > > + * separate trigger page.
> > > + */
> > > +#define XIVE_ESB_GET0x800
> > > +#define XIVE_ESB_SET_PQ_00  0xc00
> > > +#define XIVE_ESB_SET_PQ_01  0xd00
> > > +#define XIVE_ESB_SET_PQ_10  0xe00
> > > +#define XIVE_ESB_SET_PQ_11  0xf00
> > > +
> > > +#define 

Re: [Qemu-devel] [RFC PATCH v2 00/21] Guest exploitation of the XIVE interrupt controller (POWER9)

2017-09-28 Thread Benjamin Herrenschmidt
On Wed, 2017-09-20 at 14:33 +0200, Cédric Le Goater wrote:
> > > I'm thinking maybe trying to support the CAS negotiation of interrupt
> > > controller from day 1 is warping the design.  A better approach might
> > > be first to implement XIVE only when given a specific machine option -
> > > guest gets one or the other and can't negotiate.
> 
> ok. 
> 
> CAS is not the most complex problem, we mostly need to share 
> the ICSIRQState array and the source offset. migration from older
> machine is a problem. We are doomed to keep the existing XICS
> framework available.

I don't like sharing anything. I'd rather we had separate objects
alltogether. If needed we can implement CAS by doing a partition reboot
like pHyp does, at least initially, until we add ways to tear down and
rebuild objects.

The main issue is whether we can keep a consistent number space so the
DT doesn't have to be completely rebuilt. If it does, then reboot will
be the only practical option I'm afraid.

> > > That should allow a more natural XIVE design to emerge, *then* we can
> > > look at what's necessary to make boot-time negotiation possible.
> > 
> > Actually, it just occurred to me that we might be making life hard for
> > ourselves by trying to actually switch between full XICS and XIVE
> > models.  Coudln't we have new machine types always construct the XIVE
> > infrastructure, 
> 
> yes.
> 
> > but then implement the XICS RTAS and hcalls in terms of the XIVE virtual 
> > hardware.

That's gross :-)

This is also exactly what KVM does with real XIVE HW and there's also
such an emulation in OPAL. I'd be weary of creating a 3rd one...

I'd much prefer if we managed to:

 - Split the source numbering from the various state tracking objects
so we can have that common

 - Either delay the creation to after CAS or tear down & re-create the
state tracking objects at CAS time.

> ok but migration will not be supported.
> 
> > Since something more or less equivalent
> > has already been done in both OPAL and the host kernel, I'm guessing
> > this shouldn't be too hard at this point.

It would very much suck to have yet another one of these.

Also we need to understand how that would work in a KVM context, the
kernel will provide a "XICS" state even on top of XIVE unless we switch
the kernel object to native, but then the kernel will expect full
exploitation.

> Indeed that is how it is working currently on P9 kvm guests. hcalls are
> implemented on top of XIVE native.
> 
> Thanks,
> 
> 
> C.



Re: [Qemu-devel] [RFC PATCH v2 14/21] ppc/xive: add support for the SET_OS_PENDING command

2017-09-28 Thread Benjamin Herrenschmidt
On Wed, 2017-09-20 at 11:47 +0200, Cédric Le Goater wrote:
> > > @@ -162,7 +162,14 @@ static bool spapr_xive_tm_is_readonly(uint8_t index)
> > >   static void spapr_xive_tm_write_special(ICPState *icp, hwaddr offset,
> > > uint64_t value, unsigned size)
> > >   {
> > > -/* TODO: support TM_SPC_SET_OS_PENDING */
> > > +if (offset == TM_SPC_SET_OS_PENDING && size == 1) {
> > > +icp->tima_os[TM_IPB] |= priority_to_ipb(value & 0xff);
> > > +icp->tima_os[TM_PIPR] = ipb_to_pipr(icp->tima_os[TM_IPB]);
> > 
> > This only lets the cpu raise bits in the IPB, never clear them.> Is that 
> > right?  
> 
> The clear is done when the OS acks the interrupt.
> 
> > I don't see how you'd implement the handling of multiple
> > priorities without being able to clear bits here.
> 
> I am not sure how this command should be used from the OS. 
> Currently, I only see KVM handling it in the XICS/XIVE glue.
> I need to take a closer look.

It's a way to avoid the SW replay on EOI.

IE, assume you have 2 interrupts in the queue. You take the exception,
ack the first one, process it etc...

Then you EOI, the HW won't send a second notification. You need to look
at the queue and continue consuming until it's empty.

Today Linux checks the queue on EOI and use a SW mechanism to
synthetize a new pseudo-external interrupt.

This MMIO command would allow the OS to instead set back the
corresponding priority bit to 1 in the IPB and cause the HW to re-emit
the interrupt instead of SW.

Linux doesn't use this today because DD1 didn't support it for the HV
level, but other OSes might and we also might use it when we do groups,
thus allowing redistribution.

Cheers,
Ben.



Re: [Qemu-devel] [RFC PATCH v2 13/21] ppc/xive: handle interrupt acknowledgment by the O/S

2017-09-28 Thread Benjamin Herrenschmidt
On Wed, 2017-09-20 at 11:40 +0200, Cédric Le Goater wrote:
> > Plus, this doesn't seem right.  Shouldn't this
> > recheck the CPPR against the PIPR, in case a higher priority irq has
> > been delivered since the one the cpu is acking.
> 
> If a higher priority is delivered, it means that the CPPR was more 
> privileged and that we have now two bits set in the IPB by the time 
> the interrupt is acked. The high priority PIPR will become the new 
> CPPR and the IBP will be modified keeping only the lower priority. 
> 
> if the CPPR is modified to the lower priority level, then the 
> first interrupt will be delivered again. 
> 
> I think this is fine.

Also remember the HW PIPR behaviour, its a bit odd, it will be clamped
by the CPPR. So if CPPR is 0 PIPR will be 0.

If CPPR is 7, PIPR will be <= 7, etc...

Cheers,
Ben.




Re: [Qemu-devel] [RFC PATCH v2 11/21] ppc/xive: push the EQ data in OS event queue

2017-09-28 Thread Benjamin Herrenschmidt
On Wed, 2017-09-20 at 16:34 +1000, David Gibson wrote:
> > >> +if (GETFIELD(EQ_W6_FORMAT_BIT, eq->w6) == 0) {
> > >> +priority = GETFIELD(EQ_W7_F0_PRIORITY, eq->w7);
> > >>  
> > >> +/* The EQ is masked. Can this happen ?  */
> > >> +if (priority == 0xff) {
> > >> +return;
> > > 
> > > How does the 8-bit priority field here interact with the 3-bit
> > > priority which selects which EQ to use?
> > 
> > priority OxFF is a special case kept for masking, see the hcall 
> > h_int_set_source_config. It should never reach the EQ lookup 
> > routines. So may be an assert would be better here.
> 
> Ok, if this situation can't be guest triggered, only by a bug in the
> rest of the XIVE code, then an assert() is better.

Note: this doesn't match HW. However there's a mask bit in the EAS.

The problem when masking that way of course is that you lose triggers,
ie P gets set, the interrupt lost, and nobody will clear P.

Cheers,
Ben.




Re: [Qemu-devel] [RFC PATCH v2 01/21] ppc/xive: introduce a skeleton for the sPAPR XIVE interrupt controller

2017-09-26 Thread Benjamin Herrenschmidt
On Tue, 2017-09-26 at 13:54 +1000, David Gibson wrote:
> > > 
> > > Which ones?  My impression was that there needed to be at least #cpus
> > > * #priority-levels EQs, but there could be more than that, 
> > 
> > euh no, not in spapr mode at least. There are 8 queues per cpu.
> 
> Ok.

There's a HW feature of XIVE in DD2.x that I will start exploiting soon
that sacrifices a queue btw, keep that in mind.

We should probably only expose 0...6 to guests, not 0...7.

> > > so it was no longer as tightly bound to the number if "interrupt 
> > > servers"> as xics.
> > 
> > ah. I think I see what you mean, that we could allocate them on the 
> > fly when needed by some hcalls ?
> 
> Not at hcall time, no, but at cpu hot(un)plug time I was wondering if we
> could (de)allocate them then.
> 
> > The other place where I use the nr_targets is to provision the 
> > IRQ numbers for the IPIs but that could probably be done in some 
> > other way, specially it there is a IRQ allocator at the machine
> > level.
> 
> Hm, ok.
> 



Re: [Qemu-devel] [RFC PATCH 16/26] ppc/xive: notify CPU when interrupt priority is more privileged

2017-09-09 Thread Benjamin Herrenschmidt
On Sat, 2017-09-09 at 10:08 +0200, Cédric Le Goater wrote:
> On 09/09/2017 09:39 AM, Benjamin Herrenschmidt wrote:
> > On Wed, 2017-07-05 at 19:13 +0200, Cédric Le Goater wrote:
> > > Signed-off-by: Cédric Le Goater <c...@kaod.org>
> > > ---
> > >  hw/intc/xive.c | 21 +
> > >  1 file changed, 21 insertions(+)
> > > 
> > > diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> > > index c3c1e9c9db2d..cda1fa18e44d 100644
> > > --- a/hw/intc/xive.c
> > > +++ b/hw/intc/xive.c
> > > @@ -53,6 +53,21 @@ static uint64_t xive_icp_accept(XiveICPState *xicp)
> > >  return (nsr << 8) | xicp->tima_os[TM_CPPR];
> > >  }
> > >  
> > > +static uint8_t ipb_to_pipr(uint8_t ibp)
> > > +{
> > > +return ibp ? clz32((uint32_t)ibp << 24) : 0xff;
> > > +}
> > 
> > The PIPR needs to be updated also on accept etc... anything that change
> > IPBs or CPPR really.
> 
> I did forget the update in accept ... thanks. 
> 
> > Also I just learned something from the designers: The IPIR is clamped
> > to CPPR.
> > 
> > So basically the value in the PIPR is:
> > 
> > v = leftmost_bit_of(ipb) (or 0xff);
> > pipr = v < cppr ? v : cppr;
> > 
> > which means it's never actually 0xff ... surprise !
> 
> ah. that is not what the specs say but it shouldn't be a problem.
> I will change the code to reflect that.

Right, I've asked the HW guys to update the spec.

Cheers,
Ben.

> C.
> 
> > That also means I need to fix my implementation of H_IPOLL in KVM.
> > 
> > > +static void xive_icp_notify(XiveICPState *xicp)
> > > +{
> > > +xicp->tima_os[TM_PIPR] = ipb_to_pipr(xicp->tima_os[TM_IPB]);
> > > +
> > > +if (xicp->tima_os[TM_PIPR] < xicp->tima_os[TM_CPPR]) {
> > > +xicp->tima_os[TM_NSR] |= TM_QW1_NSR_EO;
> > > +qemu_irq_raise(ICP(xicp)->output);
> > > +}
> > > +}
> > > +
> > >  static void xive_icp_set_cppr(XiveICPState *xicp, uint8_t cppr)
> > >  {
> > >  if (cppr > XIVE_PRIORITY_MAX) {
> > > @@ -60,6 +75,10 @@ static void xive_icp_set_cppr(XiveICPState *xicp, 
> > > uint8_t cppr)
> > >  }
> > >  
> > >  xicp->tima_os[TM_CPPR] = cppr;
> > > +
> > > +/* CPPR has changed, inform the ICP which might raise an
> > > + * exception */
> > > +xive_icp_notify(xicp);
> > >  }
> > >  
> > >  /*
> > > @@ -339,6 +358,8 @@ static void xive_icp_irq(XiveICSState *xs, int lisn)
> > >  } else {
> > >  qemu_log_mask(LOG_UNIMP, "XIVE: w7 format1 not implemented\n");
> > >  }
> > > +
> > > +xive_icp_notify(xicp);
> > >  }
> > >  
> > >  /*



Re: [Qemu-devel] [RFC PATCH 16/26] ppc/xive: notify CPU when interrupt priority is more privileged

2017-09-09 Thread Benjamin Herrenschmidt
On Wed, 2017-07-05 at 19:13 +0200, Cédric Le Goater wrote:
> Signed-off-by: Cédric Le Goater 
> ---
>  hw/intc/xive.c | 21 +
>  1 file changed, 21 insertions(+)
> 
> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> index c3c1e9c9db2d..cda1fa18e44d 100644
> --- a/hw/intc/xive.c
> +++ b/hw/intc/xive.c
> @@ -53,6 +53,21 @@ static uint64_t xive_icp_accept(XiveICPState *xicp)
>  return (nsr << 8) | xicp->tima_os[TM_CPPR];
>  }
>  
> +static uint8_t ipb_to_pipr(uint8_t ibp)
> +{
> +return ibp ? clz32((uint32_t)ibp << 24) : 0xff;
> +}

The PIPR needs to be updated also on accept etc... anything that change
IPBs or CPPR really.

Also I just learned something from the designers: The IPIR is clamped
to CPPR.

So basically the value in the PIPR is:

v = leftmost_bit_of(ipb) (or 0xff);
pipr = v < cppr ? v : cppr;

which means it's never actually 0xff ... surprise !

That also means I need to fix my implementation of H_IPOLL in KVM.

> +static void xive_icp_notify(XiveICPState *xicp)
> +{
> +xicp->tima_os[TM_PIPR] = ipb_to_pipr(xicp->tima_os[TM_IPB]);
> +
> +if (xicp->tima_os[TM_PIPR] < xicp->tima_os[TM_CPPR]) {
> +xicp->tima_os[TM_NSR] |= TM_QW1_NSR_EO;
> +qemu_irq_raise(ICP(xicp)->output);
> +}
> +}
> +
>  static void xive_icp_set_cppr(XiveICPState *xicp, uint8_t cppr)
>  {
>  if (cppr > XIVE_PRIORITY_MAX) {
> @@ -60,6 +75,10 @@ static void xive_icp_set_cppr(XiveICPState *xicp, uint8_t 
> cppr)
>  }
>  
>  xicp->tima_os[TM_CPPR] = cppr;
> +
> +/* CPPR has changed, inform the ICP which might raise an
> + * exception */
> +xive_icp_notify(xicp);
>  }
>  
>  /*
> @@ -339,6 +358,8 @@ static void xive_icp_irq(XiveICSState *xs, int lisn)
>  } else {
>  qemu_log_mask(LOG_UNIMP, "XIVE: w7 format1 not implemented\n");
>  }
> +
> +xive_icp_notify(xicp);
>  }
>  
>  /*



Re: [Qemu-devel] [RFC PATCH 08/26] ppc/xive: add flags to the XIVE interrupt source

2017-07-25 Thread Benjamin Herrenschmidt
On Tue, 2017-07-25 at 14:18 +1000, David Gibson wrote:
> > Well, at this point I think nothing will set that flag It's there
> > for workaround around HW bugs on some chips. At least in full emu it
> > shouldn't happen unless we try to emulate those bugs. Hopefully direct
> > MMIO will just work.
> 
> Hm.  That doesn't seem like a good match for a per-irq state
> structure.

The flag is returned to the guest on a per-IRQ basis, so are the LSI
etc... flags, but at the HW level, indeed, they correspond to
attributes of blocks of interrupts.

It might be easier in qemu to keep that in the per-source flags
though.

Especially when we start having actual HW interrupts under the
hood with KVM. it's easier to keep the state self contained
for each of them.

Ben.




Re: [Qemu-devel] [RFC PATCH 08/26] ppc/xive: add flags to the XIVE interrupt source

2017-07-25 Thread Benjamin Herrenschmidt
On Tue, 2017-07-25 at 14:19 +1000, David Gibson wrote:
> > Nevertheless I have added support for the hcall in Linux and QEMU.
> > To use, I think we could create a specific source.
> 
> So, IIUC, it's host constraints that would make this one way or the
> other.  So what happens when a guest migrates from a host which has it
> one way to one which has it the other way?

It's probably ok to always call the hcall for the awy set -> unset, the
other way around is a problem, but it will end up depending on the kind
of interrupts we pass through.

Ben.



Re: [Qemu-devel] [RFC PATCH 08/26] ppc/xive: add flags to the XIVE interrupt source

2017-07-24 Thread Benjamin Herrenschmidt
On Mon, 2017-07-24 at 19:50 +1000, David Gibson wrote:
> On Mon, Jul 24, 2017 at 05:00:57PM +1000, Benjamin Herrenschmidt wrote:
> > On Mon, 2017-07-24 at 14:36 +1000, David Gibson wrote:
> > > On Wed, Jul 05, 2017 at 07:13:21PM +0200, Cédric Le Goater wrote:
> > > > These flags define some characteristics of the source :
> > > > 
> > > >  - XIVE_SRC_H_INT_ESB  the Event State Buffer are controlled with a
> > > >specific hcall H_INT_ESB
> > > 
> > > What's the other option?
> > 
> > Direct MMIO access. Normally all interrupts use normal MMIOs,
> > each interrupts has an associated MMIO page with special MMIOs
> > to control the source state (PQ bits). This is something I added
> > to the PAPR spec (and the OPAL <-> Linux interface) to allow firmware
> > to work around broken HW (which happens on some P9 versions).
> 
> Ok.. and that's something that can be decided at runtime?

Well, at this point I think nothing will set that flag It's there
for workaround around HW bugs on some chips. At least in full emu it
shouldn't happen unless we try to emulate those bugs. Hopefully direct
MMIO will just work.

Ben.




Re: [Qemu-devel] [RFC PATCH 07/26] ppc/xive: add MMIO handlers to the XIVE interrupt source

2017-07-24 Thread Benjamin Herrenschmidt
On Mon, 2017-07-24 at 14:29 +1000, David Gibson wrote:
> > +case XIVE_ESB_SET_PQ_00:
> > +case XIVE_ESB_SET_PQ_01:
> > +case XIVE_ESB_SET_PQ_10:
> > +case XIVE_ESB_SET_PQ_11:
> > +ret = xive_pq_get(x, lisn);
> > +xive_pq_set(x, lisn, (offset >> 8) & 0x3);
> 
> Again I'd prefer xive_pq_set() return the old value itself, for more
> obvious atomicity.

Agreed.  That will also help with StoreEOI (store to 0x400 of the EOI
page) which does an EOI then re-sends an interrupt if the old value was
11 (while the load EOI doesn't resend).

Cheers,
Ben.




Re: [Qemu-devel] [RFC PATCH 04/26] ppc/xive: introduce a skeleton for the XIVE interrupt controller model

2017-07-24 Thread Benjamin Herrenschmidt
On Mon, 2017-07-24 at 15:38 +1000, David Gibson wrote:
> 
> Can we assign our logical numbers sparsely, or will that cause other
> problems?

The main issue is that they probably needs to be the same between XICS
and XIVE because by the time we get the CAS call to chose between XICS
and XIVE, we have already handed out interrupts and constructed the DT,
no ? Unless we do a real CAS reboot...

Otherwise, there's no reason they can't be sparse no.

> Note that for PAPR we also have the question of finding logical
> interrupts for legacy PAPR VIO devices.

We just make them another range ? With KVM legacy today, I just use the
generic interrupt facility for those. So when you do the ioctl to
"trigger" one, I just do an MMIO to the corresponding page and the
interrupt magically shows up wherever the guest is running the target
vcpu. In fact, I'd like to add a way to mmap that page into qemu so
that qemu can triggers them without an ioctl.

The guest doesn't care, from the guest perspective they are interrupts
coming from the DT, so they are like PCI etc...

> > We can fix the number of "generic" interrupts given to a guest. The
> > only requirements from a PAPR perspective is that there should be at
> > least as many as there are possible threads in the guest so they can be
> > used as IPIs.
> 
> Ok.  If we can do things sparsely, allocating these well away from the
> hw interrupts would make things easier.
> 
> > But we may need more for other things. We can make this a machine
> > parameter with a default value of something like 4096. If we call N
> > that number of extra generic interrupts, then the number of generic
> > interrutps would be #possible-vcpu's + N, or something like that.
> 
> That seems reasonable.
> 
> > > > But it's fundamentally an allocator that sits in the hypervisor, so in
> > > > our case, I would say in the spapr "component" of XIVE, rather than the
> > > > XIVE HW model itself.
> > > 
> > > Maybe..
> > 
> > You are right in that a mapping is a better term than an allocator
> > here.
> > 
> > > > Now what Cedric did, because XIVE is very complex and we need something
> > > > for PAPR quickly, is not a complete HW model, but a somewhat simplified
> > > > one that only handles what PAPR exposes. So in that case where the
> > > > allocator sits is a bit of a TBD...
> > > 
> > > Hm, ok.  My concern here is that "dynamic" allocation of irqs at the
> > > machine type level needs extreme caution, or the irqs may not be
> > > stable which will generally break migration.
> > 
> > Yes you are right. We should probably create a more "static" scheme.
> 
> Sounds like we're in violent agreement.

Yup :)

Cheers,
Ben.




Re: [Qemu-devel] [RFC PATCH 08/26] ppc/xive: add flags to the XIVE interrupt source

2017-07-24 Thread Benjamin Herrenschmidt
On Mon, 2017-07-24 at 14:36 +1000, David Gibson wrote:
> On Wed, Jul 05, 2017 at 07:13:21PM +0200, Cédric Le Goater wrote:
> > These flags define some characteristics of the source :
> > 
> >  - XIVE_SRC_H_INT_ESB  the Event State Buffer are controlled with a
> >specific hcall H_INT_ESB
> 
> What's the other option?

Direct MMIO access. Normally all interrupts use normal MMIOs,
each interrupts has an associated MMIO page with special MMIOs
to control the source state (PQ bits). This is something I added
to the PAPR spec (and the OPAL <-> Linux interface) to allow firmware
to work around broken HW (which happens on some P9 versions).

> >  - XIVE_SRC_LSILSI or MSI source
> 
> Hrm.  This definitely duplicates info that is in the XICS per irq
> state which you're re-using (and which you're using in the xive code
> at this point).

I think all those flags correspond to the flags passed via the PAPR
API, so it makes sense to have them there.

> >  - XIVE_SRC_TRIGGERthe full function page supports trigger
> >  - XIVE_SRC_STORE_EOI  EOI can with a store.
> > 
> > Signed-off-by: Cédric Le Goater 
> > ---
> >  hw/intc/xive.c| 1 +
> >  include/hw/ppc/xive.h | 9 +
> >  2 files changed, 10 insertions(+)
> > 
> > diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> > index 816031b8ac81..8f8bb8b787bd 100644
> > --- a/hw/intc/xive.c
> > +++ b/hw/intc/xive.c
> > @@ -345,6 +345,7 @@ static Property xive_ics_properties[] = {
> >  DEFINE_PROP_UINT32("nr-irqs", ICSState, nr_irqs, 0),
> >  DEFINE_PROP_UINT32("irq-base", ICSState, offset, 0),
> >  DEFINE_PROP_UINT32("shift", XiveICSState, esb_shift, 0),
> > +DEFINE_PROP_UINT64("flags", XiveICSState, flags, 0),
> >  DEFINE_PROP_END_OF_LIST(),
> >  };
> >  
> > diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> > index 5303d96f5f59..1178300c9df3 100644
> > --- a/include/hw/ppc/xive.h
> > +++ b/include/hw/ppc/xive.h
> > @@ -30,9 +30,18 @@ typedef struct XiveICSState XiveICSState;
> >  #define TYPE_ICS_XIVE "xive-source"
> >  #define ICS_XIVE(obj) OBJECT_CHECK(XiveICSState, (obj), TYPE_ICS_XIVE)
> >  
> > +/*
> > + * XIVE Interrupt source flags
> > + */
> > +#define XIVE_SRC_H_INT_ESB (1ull << (63 - 60))
> > +#define XIVE_SRC_LSI   (1ull << (63 - 61))
> > +#define XIVE_SRC_TRIGGER   (1ull << (63 - 62))
> > +#define XIVE_SRC_STORE_EOI (1ull << (63 - 63))
> > +
> >  struct XiveICSState {
> >  ICSState parent_obj;
> >  
> > +uint64_t flags;
> >  uint32_t esb_shift;
> >  MemoryRegion esb_iomem;
> >  
> 
> 



Re: [Qemu-devel] [RFC PATCH 09/26] ppc/xive: add an overall memory region for the ESBs

2017-07-24 Thread Benjamin Herrenschmidt
On Mon, 2017-07-24 at 14:49 +1000, David Gibson wrote:
> On Wed, Jul 05, 2017 at 07:13:22PM +0200, Cédric Le Goater wrote:
> > Each source adds its own ESB mempry region to the overall ESB memory
> > region of the controller. It will be mapped in the CPU address space
> > when XIVE is activated.
> > 
> > The default mapping address for the ESB memory region is the same one
> > used on baremetal.
> > 
> > Signed-off-by: Cédric Le Goater 
> > ---
> >  hw/intc/xive-internal.h |  5 +
> >  hw/intc/xive.c  | 44 +++-
> >  2 files changed, 48 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/intc/xive-internal.h b/hw/intc/xive-internal.h
> > index 8e755aa88a14..c06be823aad0 100644
> > --- a/hw/intc/xive-internal.h
> > +++ b/hw/intc/xive-internal.h
> > @@ -98,6 +98,7 @@ struct XIVE {
> >  SysBusDevice parent;
> >  
> >  /* Properties */
> > +uint32_t chip_id;
> 
> So there is a XIVE object per chip.  How does this work on PAPR?  One
> logical chip/XIVE, or something more complex?

One global XIVE for PAPR. For the MMIOs, the way it works is that:

 - For MMIOs pertaining to a specific interrupt or queue, there's an H-
call that will return the proper "guest physical" address. For qemu
with KVM we'll have to probably create a single chunk of qemu address
space (a single mem region) that contains individual pages mapped with
MAP_FIXED originating from the different HW bits, we still need to sort
out how exactly we'll do that in practice.

 - For the TIMA (the presentation MMIOs), those are always at the same
physical address for everybody (so for a guest it's a single memory
region we'll map to that physical address), the HW "knows" which HW
thread is talking to it (and the hypervisor tells the HW which vcpu is
running on a given HW thread at a given point in time). That address is
obtained from the device-tree

> >  uint32_t nr_targets;
> >  
> >  /* IRQ number allocator */
> > @@ -111,6 +112,10 @@ struct XIVE {
> >  void *sbe;
> >  XiveIVE  *ivt;
> >  XiveEQ   *eqdt;
> > +
> > +/* ESB and TIMA memory location */
> > +hwaddr   vc_base;
> > +MemoryRegion esb_iomem;
> >  };
> >  
> >  void xive_reset(void *dev);
> > diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> > index 8f8bb8b787bd..a1cb87a07b76 100644
> > --- a/hw/intc/xive.c
> > +++ b/hw/intc/xive.c
> > @@ -312,6 +312,7 @@ static void xive_ics_realize(ICSState *ics, Error 
> > **errp)
> >  XiveICSState *xs = ICS_XIVE(ics);
> >  Object *obj;
> >  Error *err = NULL;
> > +XIVE *x;
> 
> I don't really like just 'x' for a context variable like this (as
> opposed to a temporary).
> 
> >  
> >  obj = object_property_get_link(OBJECT(xs), "xive", );
> >  if (!obj) {
> > @@ -319,7 +320,7 @@ static void xive_ics_realize(ICSState *ics, Error 
> > **errp)
> > __func__, error_get_pretty(err));
> >  return;
> >  }
> > -xs->xive = XIVE(obj);
> > +x = xs->xive = XIVE(obj);
> >  
> >  if (!ics->nr_irqs) {
> >  error_setg(errp, "Number of interrupts needs to be greater 0");
> > @@ -338,6 +339,11 @@ static void xive_ics_realize(ICSState *ics, Error 
> > **errp)
> >"xive.esb",
> >(1ull << xs->esb_shift) * ICS_BASE(xs)->nr_irqs);
> >  
> > +/* Install the ESB memory region in the overall one */
> > +memory_region_add_subregion(>esb_iomem,
> > +ICS_BASE(xs)->offset * (1 << 
> > xs->esb_shift),
> > +>esb_iomem);
> > +
> >  qemu_register_reset(xive_ics_reset, xs);
> >  }
> >  
> > @@ -375,6 +381,32 @@ static const TypeInfo xive_ics_info = {
> >   */
> >  #define MAX_HW_IRQS_ENTRIES (8 * 1024)
> >  
> > +/* VC BAR contains set translations for the ESBs and the EQs. */
> > +#define VC_BAR_DEFAULT   0x100ull
> > +#define VC_BAR_SIZE  0x080ull
> > +
> > +#define P9_MMIO_BASE 0x006ull
> > +#define P9_CHIP_BASE(id) (P9_MMIO_BASE | (0x400ull * (uint64_t) 
> > (id)))
> 
> chip-based MMIO addresses leaking into the PAPR model seems like it
> might not be what you want
> 
> > +static uint64_t xive_esb_default_read(void *p, hwaddr offset, unsigned 
> > size)
> > +{
> > +qemu_log_mask(LOG_UNIMP, "%s: 0x%" HWADDR_PRIx " [%u]\n",
> > +  __func__, offset, size);
> > +return 0;
> > +}
> > +
> > +static void xive_esb_default_write(void *opaque, hwaddr offset, uint64_t 
> > value,
> > +unsigned size)
> > +{
> > +qemu_log_mask(LOG_UNIMP, "%s: 0x%" HWADDR_PRIx " <- 0x%" PRIx64 " 
> > [%u]\n",
> > +  __func__, offset, value, size);
> > +}
> > +
> > +static const MemoryRegionOps xive_esb_default_ops = {
> > +.read = xive_esb_default_read,
> > +.write = xive_esb_default_write,
> > +.endianness = DEVICE_BIG_ENDIAN,
> > +};
> >  
> >  void xive_reset(void 

Re: [Qemu-devel] [RFC PATCH 04/26] ppc/xive: introduce a skeleton for the XIVE interrupt controller model

2017-07-23 Thread Benjamin Herrenschmidt
On Mon, 2017-07-24 at 13:28 +1000, David Gibson wrote:
> > So yes, in PAPR there's an "allocator" because the hypervisor will
> > create a guest "virtual" (or logical to use PAPR terminology) interrupt
> > number space, in order to represents the various interrupts into the
> > guest.
> 
> Ok, but are each of those logical irqs bound to a specific device/PHB
> line/whatever, or can they be configured by the guest?

So for clarity, let's first establish the terminology :-)

 - HW number is a HW interrupt number on a "bare metal" system or
powernv guest. For now we will ignore those, they are effectively a
side effect of how skiboot configure the XIVE and qemu per-se doesn't
allocate them.

 - A logical number is a "guest physical" interrupt number for a PAPR
guest. These fall into roughly 2 categories at the moment:

* "interrupts" (or related) properties in the DT, typically
interrupts for a PCI device, ranges of MSIs etc... that correspond to
HW sources from a PHB.

* "generic IPIs". Those are ranges of "generic" interrupts that the
hypervisor gives the guest. On a real system, they correspond to chunks
allocated off a HW facility for generic interrupts. Generic interrupts
are the same as normal interrupts from the prespective of
managing/receiving them, but are "triggered" by an MMIO to a certain HW
page. There's a DT property telling the guest the interrupt number
ranges for these guys.

So that logical number above is what a PAPR guest obtains from the DT
and uses for the various H-call used to manage and configure interrupt
sources.

In addition, the XIVE supports renumbering the interrupt number that
you obtain in the queues. Both bare metal linux, KVM and guests make
use of this. This only changes the number you observe in a queue when
you receive an interrupt, it has no effect on the HW number or logical
number used for the various management calls.

This is used by Linux so that:

  - On bare metal systems or PAPR guest with "exploitation mode" (ie,
PAPR guest directly using the XIVE), we put the linux interrupt number
in there as to avoid the reverse-mapping done by linux otherwise when
receiving an interrupt.

  - On PARP guests using the legacy hcalls, KVM configures the logical
number there.

> > Those numbers however are just tokens, they don't have to represent any
> > real HW concept. So they can be "allocated" in a rather fixed way, for
> > example, you could have something like a fixed map where you put all
> > the PCI interrupts at a certain number (a factor of the PHB# with room
> > or a fix number per PHB, maybe 16K or so, the HW does 4K max). Another
> > based would have a chunk of "general purpose" IPIs (for use for actual
> > IPIs and for other things to come). And a range for the virtual device
> > interrupts for example. Or you can just use an allocator.
> 
> Hm.  So what I'm meaning by an "allocator" is something at least
> partially dynamic.  Something you say "give me an irq" and it gives
> you the next available or similar.  As opposed to any mapping from
> devices to (logical) irqs, which the machine will need to supply one
> way or another.

For the sake of repeatability/migration etc... I think a mapping is
better than an allocator.  IE, a fixed number scheme so that the range
of interrupts for PHB#x is always a fixed function of x.

We can fix the number of "generic" interrupts given to a guest. The
only requirements from a PAPR perspective is that there should be at
least as many as there are possible threads in the guest so they can be
used as IPIs.

But we may need more for other things. We can make this a machine
parameter with a default value of something like 4096. If we call N
that number of extra generic interrupts, then the number of generic
interrutps would be #possible-vcpu's + N, or something like that.

> > But it's fundamentally an allocator that sits in the hypervisor, so in
> > our case, I would say in the spapr "component" of XIVE, rather than the
> > XIVE HW model itself.
> 
> Maybe..

You are right in that a mapping is a better term than an allocator
here.

> > Now what Cedric did, because XIVE is very complex and we need something
> > for PAPR quickly, is not a complete HW model, but a somewhat simplified
> > one that only handles what PAPR exposes. So in that case where the
> > allocator sits is a bit of a TBD...
> 
> Hm, ok.  My concern here is that "dynamic" allocation of irqs at the
> machine type level needs extreme caution, or the irqs may not be
> stable which will generally break migration.

Yes you are right. We should probably create a more "static" scheme.

Cheers,
Ben.




Re: [Qemu-devel] [RFC PATCH 04/26] ppc/xive: introduce a skeleton for the XIVE interrupt controller model

2017-07-21 Thread Benjamin Herrenschmidt
On Fri, 2017-07-21 at 17:50 +1000, David Gibson wrote:
> On Wed, Jul 19, 2017 at 02:02:18PM +1000, Benjamin Herrenschmidt wrote:
> > On Wed, 2017-07-19 at 13:08 +1000, David Gibson wrote:
> > > 
> > > I'm somewhat uncomfortable with an irq allocater here in the intc
> > > code.  As a rule, irq allocation is the responsibility of the machine,
> > > not any sub-component.  Furthermore, it should allocate in a way which
> > > is repeatable, since they need to stay stable across reboots and
> > > migrations.
> > > 
> > > And, yes, we have an allocator of sorts in XICS - it has caused a
> > > number of problems in the past.
> > 
> > So
> > 
> > For a bare metal model (which we don't have yet) of XIVE, the IRQ
> > numbering is entirely an artifact of how the HW is configured. There
> > should thus be no interrupt numbers visible to qemu.
> 
> Uh.. I don't entirely follow.  Do you mean that during boot the guest
> programs the irq numbers into the various components?

I said a "bare metal model" but yes. Pretty much. 
> 
> In that case this allocator stuff definitely doesn't belong on the
> xive code.
> 
> > For a PAPR model things are a bit different, but if we want to
> > maximize code re-use between the two, we probably need to make sure
> > the interrupts "allocated" by the machine for XIVE can be represented
> > by the HW model.
> > 
> > That means:
> > 
> >  - Each chip has a range (high bits are the block ID, which maps to a
> > chip, low bits, around 512K to 1M interrupts is the per-chip space).
> > 
> >  - Interrupts 0...N of that range (N depends on how much backing
> > memory and MMIO space is provisioned for each chip) are "generic IPIs"
> > which are somewhat generic interrupt source that can be triggered with
> > an MMIO store and routed to any target. Those are used in PAPR for
> > things like IPIs and some type of accelerator interrupts.
> > 
> >  - Portions of that range (which may or may not overlap the 0...N
> > above, if they do they "shadow" the generic interrupts) can be
> > configured to be the HW sources from the various PCIe bridges and
> > the PSI controller.
> 
> Err.. I'm confused how this not sure this relates to spapr.  There are
> no chips or PSI there, and the PCI bridges aren't really the same
> thing.

The above is the HW model, sorry for the confusion. With a few comments
about how they are used in PAPR.

So yes, in PAPR there's an "allocator" because the hypervisor will
create a guest "virtual" (or logical to use PAPR terminology) interrupt
number space, in order to represents the various interrupts into the
guest.

Those numbers however are just tokens, they don't have to represent any
real HW concept. So they can be "allocated" in a rather fixed way, for
example, you could have something like a fixed map where you put all
the PCI interrupts at a certain number (a factor of the PHB# with room
or a fix number per PHB, maybe 16K or so, the HW does 4K max). Another
based would have a chunk of "general purpose" IPIs (for use for actual
IPIs and for other things to come). And a range for the virtual device
interrupts for example. Or you can just use an allocator.

But it's fundamentally an allocator that sits in the hypervisor, so in
our case, I would say in the spapr "component" of XIVE, rather than the
XIVE HW model itself.

Now what Cedric did, because XIVE is very complex and we need something
for PAPR quickly, is not a complete HW model, but a somewhat simplified
one that only handles what PAPR exposes. So in that case where the
allocator sits is a bit of a TBD...

Cheers,
Ben.



Re: [Qemu-devel] [RFC PATCH 04/26] ppc/xive: introduce a skeleton for the XIVE interrupt controller model

2017-07-18 Thread Benjamin Herrenschmidt
On Wed, 2017-07-19 at 14:01 +1000, David Gibson wrote:
> On Wed, Jul 19, 2017 at 01:56:57PM +1000, Benjamin Herrenschmidt wrote:
> > On Wed, 2017-07-19 at 13:08 +1000, David Gibson wrote:
> > > On Wed, Jul 05, 2017 at 07:13:17PM +0200, Cédric Le Goater wrote:
> > > > Let's provide an empty shell for the XIVE controller model with a
> > > > couple of attributes for the IRQ number allocator. The latter is
> > > > largely inspired by OPAL which allocates IPI IRQ numbers from the
> > > > bottom of the IRQ number space and allocates the HW IRQ numbers from
> > > > the top.
> > > > 
> > > > The number of IPIs is simply deduced from the max number of CPUs the
> > > > guest supports and we provision a arbitrary number of HW irqs.
> > > > 
> > > > The XIVE object is kept private because it will hold internal tables
> > > > which do not need to be exposed to sPAPR.
> > 
> > It does have an MMIO presence though... more than one even. There's the
> > TIMA (per-HW thread control area) and there's the per-interrupt MMIO
> > space which are exposed to the guest. There's also the per-queue
> > MMIO control area too.
> 
> Ok.  Always?  Or just on powernv?
> 
> If it only has an MMIO presence on powernv, then the "core" xive
> object should probably be TYPE_DEVICE, with the powernv specific
> device being a SysBusDevice which incorporates the core xive device
> inside it.

No the ones above are on PAPR. PowerNV has even more :-)

The TIMA (thread management area) is the MMIO area through which
you control the current CPU priority etc...

It's designed in HW to "know" which core/thread is accessing it (it's
at a fixed address) and respond appropriately based on that and which
virtual CPU has been activated on that core/thread.

It's part of what allows XIVE to deliver interrupts without any HV
calls.

Cheers,
Ben.



Re: [Qemu-devel] [RFC PATCH 04/26] ppc/xive: introduce a skeleton for the XIVE interrupt controller model

2017-07-18 Thread Benjamin Herrenschmidt
On Wed, 2017-07-19 at 13:08 +1000, David Gibson wrote:
> 
> I'm somewhat uncomfortable with an irq allocater here in the intc
> code.  As a rule, irq allocation is the responsibility of the machine,
> not any sub-component.  Furthermore, it should allocate in a way which
> is repeatable, since they need to stay stable across reboots and
> migrations.
> 
> And, yes, we have an allocator of sorts in XICS - it has caused a
> number of problems in the past.

So

For a bare metal model (which we don't have yet) of XIVE, the IRQ
numbering is entirely an artifact of how the HW is configured. There
should thus be no interrupt numbers visible to qemu.

For a PAPR model things are a bit different, but if we want to
maximize code re-use between the two, we probably need to make sure
the interrupts "allocated" by the machine for XIVE can be represented
by the HW model.

That means:

 - Each chip has a range (high bits are the block ID, which maps to a
chip, low bits, around 512K to 1M interrupts is the per-chip space).

 - Interrupts 0...N of that range (N depends on how much backing
memory and MMIO space is provisioned for each chip) are "generic IPIs"
which are somewhat generic interrupt source that can be triggered with
an MMIO store and routed to any target. Those are used in PAPR for
things like IPIs and some type of accelerator interrupts.

 - Portions of that range (which may or may not overlap the 0...N
above, if they do they "shadow" the generic interrupts) can be
configured to be the HW sources from the various PCIe bridges and
the PSI controller.

Cheers,
Ben.

> > +}
> > +
> > +static Property xive_properties[] = {
> > +DEFINE_PROP_UINT32("nr-targets", XIVE, nr_targets, 0),
> > +DEFINE_PROP_END_OF_LIST(),
> > +};
> > +
> > +static void xive_class_init(ObjectClass *klass, void *data)
> > +{
> > +DeviceClass *dc = DEVICE_CLASS(klass);
> > +
> > +dc->realize = xive_realize;
> > +dc->props = xive_properties;
> > +dc->desc = "XIVE";
> > +}
> > +
> > +static const TypeInfo xive_info = {
> > +.name = TYPE_XIVE,
> > +.parent = TYPE_SYS_BUS_DEVICE,
> > +.instance_init = xive_init,
> > +.instance_size = sizeof(XIVE),
> > +.class_init = xive_class_init,
> > +};
> > +
> > +static void xive_register_types(void)
> > +{
> > +type_register_static(_info);
> > +}
> > +
> > +type_init(xive_register_types)
> > diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> > new file mode 100644
> > index ..863f5a9c6b5f
> > --- /dev/null
> > +++ b/include/hw/ppc/xive.h
> > @@ -0,0 +1,27 @@
> > +/*
> > + * QEMU PowerPC XIVE model
> > + *
> > + * Copyright (c) 2017, IBM Corporation.
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License, version 2, as
> > + * published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, see .
> > + */
> > +
> > +#ifndef PPC_XIVE_H
> > +#define PPC_XIVE_H
> > +
> > +typedef struct XIVE XIVE;
> > +
> > +#define TYPE_XIVE "xive"
> > +#define XIVE(obj) OBJECT_CHECK(XIVE, (obj), TYPE_XIVE)
> > +
> > +#endif /* PPC_XIVE_H */
> 



Re: [Qemu-devel] [RFC PATCH 04/26] ppc/xive: introduce a skeleton for the XIVE interrupt controller model

2017-07-18 Thread Benjamin Herrenschmidt
On Wed, 2017-07-19 at 13:08 +1000, David Gibson wrote:
> On Wed, Jul 05, 2017 at 07:13:17PM +0200, Cédric Le Goater wrote:
> > Let's provide an empty shell for the XIVE controller model with a
> > couple of attributes for the IRQ number allocator. The latter is
> > largely inspired by OPAL which allocates IPI IRQ numbers from the
> > bottom of the IRQ number space and allocates the HW IRQ numbers from
> > the top.
> > 
> > The number of IPIs is simply deduced from the max number of CPUs the
> > guest supports and we provision a arbitrary number of HW irqs.
> > 
> > The XIVE object is kept private because it will hold internal tables
> > which do not need to be exposed to sPAPR.

It does have an MMIO presence though... more than one even. There's the
TIMA (per-HW thread control area) and there's the per-interrupt MMIO
space which are exposed to the guest. There's also the per-queue
MMIO control area too.

Ben.

> > Signed-off-by: Cédric Le Goater 
> > ---
> >  default-configs/ppc64-softmmu.mak |  1 +
> >  hw/intc/Makefile.objs |  1 +
> >  hw/intc/xive-internal.h   | 28 
> >  hw/intc/xive.c| 94 
> > +++
> >  include/hw/ppc/xive.h | 27 +++
> >  5 files changed, 151 insertions(+)
> >  create mode 100644 hw/intc/xive-internal.h
> >  create mode 100644 hw/intc/xive.c
> >  create mode 100644 include/hw/ppc/xive.h
> > 
> > diff --git a/default-configs/ppc64-softmmu.mak 
> > b/default-configs/ppc64-softmmu.mak
> > index 46c95993217d..1179c07e6e9f 100644
> > --- a/default-configs/ppc64-softmmu.mak
> > +++ b/default-configs/ppc64-softmmu.mak
> > @@ -56,6 +56,7 @@ CONFIG_SM501=y
> >  CONFIG_XICS=$(CONFIG_PSERIES)
> >  CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
> >  CONFIG_XICS_KVM=$(and $(CONFIG_PSERIES),$(CONFIG_KVM))
> > +CONFIG_XIVE=$(CONFIG_PSERIES)
> >  # For PReP
> >  CONFIG_SERIAL_ISA=y
> >  CONFIG_MC146818RTC=y
> > diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
> > index 78426a7dafcd..28b83456bfcc 100644
> > --- a/hw/intc/Makefile.objs
> > +++ b/hw/intc/Makefile.objs
> > @@ -35,6 +35,7 @@ obj-$(CONFIG_SH4) += sh_intc.o
> >  obj-$(CONFIG_XICS) += xics.o
> >  obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
> >  obj-$(CONFIG_XICS_KVM) += xics_kvm.o
> > +obj-$(CONFIG_XIVE) += xive.o
> >  obj-$(CONFIG_POWERNV) += xics_pnv.o
> >  obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
> >  obj-$(CONFIG_S390_FLIC) += s390_flic.o
> > diff --git a/hw/intc/xive-internal.h b/hw/intc/xive-internal.h
> > new file mode 100644
> > index ..155c2dcd6066
> > --- /dev/null
> > +++ b/hw/intc/xive-internal.h
> > @@ -0,0 +1,28 @@
> > +/*
> > + * Copyright 2016,2017 IBM Corporation.
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public License
> > + * as published by the Free Software Foundation; either version
> > + * 2 of the License, or (at your option) any later version.
> > + */
> > +#ifndef _INTC_XIVE_INTERNAL_H
> > +#define _INTC_XIVE_INTERNAL_H
> > +
> > +#include 
> > +
> > +struct XIVE {
> > +SysBusDevice parent;
> 
> XIVE probably shouldn't be a SysBusDevice.  According to agraf, that
> should only be used for things which have an MMIO presence on a bus
> structure that's not worth the bother of more specifically modelling.
> 
> I don't think that's the case for XIVE, so it should just have
> TYPE_DEVICE as its parent.  There are several pseries things which
> already get this wrong (mostly because I made them before fully
> understanding the role of the SysBus), but we should avoid adding
> others.
> 
> > +/* Properties */
> > +uint32_t nr_targets;
> > +
> > +/* IRQ number allocator */
> > +uint32_t int_count; /* Number of interrupts: nr_targets + HW 
> > IRQs */
> > +uint32_t int_base;  /* Min index */
> > +uint32_t int_max;   /* Max index */
> > +uint32_t int_hw_bot;/* Bottom index of HW IRQ allocator */
> > +uint32_t int_ipi_top;   /* Highest IPI index handed out so far + 1 
> > */
> > +};
> > +
> > +#endif /* _INTC_XIVE_INTERNAL_H */
> > diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> > new file mode 100644
> > index ..5b4ea915d87c
> > --- /dev/null
> > +++ b/hw/intc/xive.c
> > @@ -0,0 +1,94 @@
> > +/*
> > + * QEMU PowerPC XIVE model
> > + *
> > + * Copyright (c) 2017, IBM Corporation.
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License, version 2, as
> > + * published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the 

Re: [Qemu-devel] [RFC PATCH 00/26] guest exploitation of the XIVE interrupt controller (POWER9)

2017-07-18 Thread Benjamin Herrenschmidt
On Wed, 2017-07-19 at 13:00 +1000, David Gibson wrote:
> So, this is probably obvious, but I'm not considering this a candidate
> for qemu 2.10 (seeing as the soft freeze was yesterday).  I'll still
> try to review and, once ready, queue for 2.11.

Right. I need to review still and we need to make sure we have the
right plumbing for migration etc... and of course I need to do the
KVM bits. So it's definitely not 2.10 material.

Cheers,
Ben.




Re: [Qemu-devel] [PATCH v3] spapr: disable decrementer during reset

2017-07-18 Thread Benjamin Herrenschmidt
On Tue, 2017-07-18 at 10:56 +0530, Nikunj A Dadhania wrote:
> In case of reboot, all CPUs are resumed after reboot. So we check the
> next condition cpu_has_work() in cpu_thread_is_idle(), where we see a
> DECR interrupt and remove the CPU from halted state as the CPU has work.

Shouldn't we put them back in the stopped state then ?

Cheers,
Ben.




  1   2   3   4   5   6   7   8   9   10   >