date:20220210

Re: [PATCH v4 11/15] hw/nvme: Calculate BAR attributes in a function

2022-02-10 Thread Klaus Jensen

On Jan 26 18:11, Lukasz Maniak wrote:
> From: Łukasz Gieryk 
> 
> An NVMe device with SR-IOV capability calculates the BAR size
> differently for PF and VF, so it makes sense to extract the common code
> to a separate function.
> 
> Signed-off-by: Łukasz Gieryk 
> ---
>  hw/nvme/ctrl.c | 45 +++--
>  1 file changed, 31 insertions(+), 14 deletions(-)
> 
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index 40eb6bd1a8..e101cb7d7c 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -6431,6 +6431,34 @@ static void nvme_init_pmr(NvmeCtrl *n, PCIDevice 
> *pci_dev)
>  memory_region_set_enabled(&n->pmr.dev->mr, false);
>  }
>  
> +static uint64_t nvme_bar_size(unsigned total_queues, unsigned total_irqs,
> +  unsigned *msix_table_offset,
> +  unsigned *msix_pba_offset)
> +{
> +uint64_t bar_size, msix_table_size, msix_pba_size;
> +
> +bar_size = sizeof(NvmeBar) + 2 * total_queues * NVME_DB_SIZE;
> +bar_size = QEMU_ALIGN_UP(bar_size, 4 * KiB);
> +
> +if (msix_table_offset) {
> +*msix_table_offset = bar_size;
> +}
> +
> +msix_table_size = PCI_MSIX_ENTRY_SIZE * total_irqs;
> +bar_size += msix_table_size;
> +bar_size = QEMU_ALIGN_UP(bar_size, 4 * KiB);
> +
> +if (msix_pba_offset) {
> +*msix_pba_offset = bar_size;
> +}
> +
> +msix_pba_size = QEMU_ALIGN_UP(total_irqs, 64) / 8;
> +bar_size += msix_pba_size;
> +
> +bar_size = pow2ceil(bar_size);
> +return bar_size;
> +}
> +
>  static void nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset,
>  uint64_t bar_size)
>  {
> @@ -6470,7 +6498,7 @@ static int nvme_add_pm_capability(PCIDevice *pci_dev, 
> uint8_t offset)
>  static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
>  {
>  uint8_t *pci_conf = pci_dev->config;
> -uint64_t bar_size, msix_table_size, msix_pba_size;
> +uint64_t bar_size;
>  unsigned msix_table_offset, msix_pba_offset;
>  int ret;
>  
> @@ -6496,19 +6524,8 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice 
> *pci_dev, Error **errp)
>  }
>  
>  /* add one to max_ioqpairs to account for the admin queue pair */
> -bar_size = sizeof(NvmeBar) +
> -   2 * (n->params.max_ioqpairs + 1) * NVME_DB_SIZE;
> -bar_size = QEMU_ALIGN_UP(bar_size, 4 * KiB);
> -msix_table_offset = bar_size;
> -msix_table_size = PCI_MSIX_ENTRY_SIZE * n->params.msix_qsize;
> -
> -bar_size += msix_table_size;
> -bar_size = QEMU_ALIGN_UP(bar_size, 4 * KiB);
> -msix_pba_offset = bar_size;
> -msix_pba_size = QEMU_ALIGN_UP(n->params.msix_qsize, 64) / 8;
> -
> -bar_size += msix_pba_size;
> -bar_size = pow2ceil(bar_size);
> +bar_size = nvme_bar_size(n->params.max_ioqpairs + 1, 
> n->params.msix_qsize,
> + &msix_table_offset, &msix_pba_offset);
>  
>  memory_region_init(&n->bar0, OBJECT(n), "nvme-bar0", bar_size);
>  memory_region_init_io(&n->iomem, OBJECT(n), &nvme_mmio_ops, n, "nvme",
> -- 
> 2.25.1
> 

Looks good,

Reviewed-by: Klaus Jensen 


signature.asc
Description: PGP signature

Re: [PATCH v4 15/15] hw/nvme: Update the initalization place for the AER queue

2022-02-10 Thread Klaus Jensen

On Jan 26 18:11, Lukasz Maniak wrote:
> From: Łukasz Gieryk 
> 
> This patch updates the initialization place for the AER queue, so it’s
> initialized once, at controller initialization, and not every time
> controller is enabled.
> 
> While the original version works for a non-SR-IOV device, as it’s hard
> to interact with the controller if it’s not enabled, the multiple
> reinitialization is not necessarily correct.
> 
> With the SR/IOV feature enabled a segfault can happen: a VF can have its
> controller disabled, while a namespace can still be attached to the
> controller through the parent PF. An event generated in such case ends
> up on an uninitialized queue.
> 
> While it’s an interesting question whether a VF should support AER in
> the first place, I don’t think it must be answered today.
> 
> Signed-off-by: Łukasz Gieryk 
> ---
>  hw/nvme/ctrl.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index 624db2f9c6..b2228e960f 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -6029,8 +6029,6 @@ static int nvme_start_ctrl(NvmeCtrl *n)
>  
>  nvme_set_timestamp(n, 0ULL);
>  
> -QTAILQ_INIT(&n->aer_queue);
> -
>  nvme_select_iocs(n);
>  
>  return 0;
> @@ -7007,6 +7005,8 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice 
> *pci_dev)
>  id->cmic |= NVME_CMIC_MULTI_CTRL;
>  }
>  
> +QTAILQ_INIT(&n->aer_queue);
> +
>  NVME_CAP_SET_MQES(cap, 0x7ff);
>  NVME_CAP_SET_CQR(cap, 1);
>  NVME_CAP_SET_TO(cap, 0xf);
> -- 
> 2.25.1
> 

Fix is good, but I think this belongs in nvme_init_state(). Otherwise,

Reviewed-by: Klaus Jensen 


signature.asc
Description: PGP signature

Re: [PATCH v4 00/15] hw/nvme: SR-IOV with Virtualization Enhancements

2022-02-10 Thread Klaus Jensen

On Jan 26 18:11, Lukasz Maniak wrote:
> Changes since v3:
> - Addressed comments to review on pcie: Add support for Single Root I/O
>   Virtualization (SR/IOV)
> - Fixed issues reported by checkpatch.pl
> 
> Knut Omang (2):
>   pcie: Add support for Single Root I/O Virtualization (SR/IOV)
>   pcie: Add some SR/IOV API documentation in docs/pcie_sriov.txt
> 
> Lukasz Maniak (4):
>   hw/nvme: Add support for SR-IOV
>   hw/nvme: Add support for Primary Controller Capabilities
>   hw/nvme: Add support for Secondary Controller List
>   docs: Add documentation for SR-IOV and Virtualization Enhancements
> 
> Łukasz Gieryk (9):
>   pcie: Add a helper to the SR/IOV API
>   pcie: Add 1.2 version token for the Power Management Capability
>   hw/nvme: Implement the Function Level Reset
>   hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime
>   hw/nvme: Remove reg_size variable and update BAR0 size calculation
>   hw/nvme: Calculate BAR attributes in a function
>   hw/nvme: Initialize capability structures for primary/secondary
> controllers
>   hw/nvme: Add support for the Virtualization Management command
>   hw/nvme: Update the initalization place for the AER queue
> 
>  docs/pcie_sriov.txt  | 115 ++
>  docs/system/devices/nvme.rst |  36 ++
>  hw/nvme/ctrl.c   | 675 ---
>  hw/nvme/ns.c |   2 +-
>  hw/nvme/nvme.h   |  55 ++-
>  hw/nvme/subsys.c |  75 +++-
>  hw/nvme/trace-events |   6 +
>  hw/pci/meson.build   |   1 +
>  hw/pci/pci.c | 100 --
>  hw/pci/pcie.c|   5 +
>  hw/pci/pcie_sriov.c  | 302 
>  hw/pci/trace-events  |   5 +
>  include/block/nvme.h |  65 
>  include/hw/pci/pci.h |  12 +-
>  include/hw/pci/pci_ids.h |   1 +
>  include/hw/pci/pci_regs.h|   1 +
>  include/hw/pci/pcie.h|   6 +
>  include/hw/pci/pcie_sriov.h  |  77 
>  include/qemu/typedefs.h  |   2 +
>  19 files changed, 1460 insertions(+), 81 deletions(-)
>  create mode 100644 docs/pcie_sriov.txt
>  create mode 100644 hw/pci/pcie_sriov.c
>  create mode 100644 include/hw/pci/pcie_sriov.h
> 
> -- 
> 2.25.1
> 
> 

Hi Lukasz,

Back in v3 you changed this:

- Secondary controller cannot be set online unless the corresponding VF
  is enabled (sriov_numvfs set to at least the secondary controller's VF
  number)

I'm having issues getting this to work now. As I understand it, this now
requires that sriov_numvfs is set prior to onlining the devices, i.e.:

  echo 1 > /sys/bus/pci/devices/\:01\:00.0/sriov_numvfs

However, this causes the kernel to reject it:

  nvme nvme1: Device not ready; aborting initialisation, CSTS=0x2
  nvme nvme1: Removing after probe failure status: -19

Is this the expected behavior? Must I manually bind the device again to
the nvme driver? Prior to v3 this worked just fine since the VF was
onlined at this point.

It would be useful if you added a small "onlining for dummies" section
to the docs ;)


signature.asc
Description: PGP signature

Re: [PATCH v3 18/37] target/ppc: implement vgnb

2022-02-10 Thread Richard Henderson


On 2/10/22 23:34, matheus.fe...@eldorado.org.br wrote:

+for (int dw = 1; dw >= 0; dw--) {
+get_avr64(vrb, a->vrb, dw);
+for (; in >= 0; in -= a->n, out--) {
+if (in > out) {
+tcg_gen_shri_i64(tmp, vrb, in - out);
+} else {
+tcg_gen_shli_i64(tmp, vrb, out - in);
+}
+tcg_gen_andi_i64(tmp, tmp, 1ULL << out);
+tcg_gen_or_i64(rt, rt, tmp);
+}
+in += 64;
+}


This is going to produce up to 3*64 operations (n=2).

You can produce more than one output pairing per shift,
and produce the same result in 3*lg2(64) operations.

I've given an example like this on the list before, recently.
I think it was in the context of some riscv bit manipulation.


N = 2

AxBxCxDxExFxGxHxIxJxKxLxMxNxOxPxQxRxSxTxUxVxWxXxYxZx0x1x2x3x4x5x
  & rep(0b10)
A.B.C.D.E.F.G.H.I.J.K.L.M.N.O.P.Q.R.S.T.U.V.W.X.Y.Z.0.1.2.3.4.5.
  << 1
.B.C.D.E.F.G.H.I.J.K.L.M.N.O.P.Q.R.S.T.U.V.W.X.Y.Z.0.1.2.3.4.5..
  |
ABBCCDDEEFFGGHHIIJJKKLLMMNNOOPPQQRRSSTTUUVVWWXXYYZZ001122334455.
  & rep(0b1100)
AB..CD..EF..GH..IJ..KL..MN..OP..QR..ST..UV..WX..YZ..01..23..45..
  << 2
..CD..EF..GH..IJ..KL..MN..OP..QR..ST..UV..WX..YZ..01..23..45
  |
ABCDCDEFEFGHGHIJIJKLKLMNMNOPOPWQQRSTSTUVUVWXWXYZYZ010123234545..
  & rep(0xf0)
ABCDEFGHIJKLMNOPQRSTUVWXYZ012345
  << 4
EFGHIJKLMNOPQRSTUVWXYZ012345
  |
ABCDEFGHEFGHIJKLIJKLMNOPMNOPQRSTQRSTUVWXUVWXYZ01YZ0123452345
  & rep(0xff00)
ABCDEFGHIJKLMNOPQRSTUVWXYZ012345
  << 8
IJKLMNOPQRSTUVWXYZ012345
  |
ABCDEFGHIJKLMNOPIJKLMNOPQRSTUVWXQRSTUVWXYZ012345YZ012345
  & rep(0x)
ABCDEFGHIJKLMNOPQRSTUVWXYZ012345
  deposit(t, 32, 16)
ABCDEFGHIJKLMNOPQRSTUVWXYZ012346


and similarly for larger N.  For N >= 4, I believe that half of the masking may be elided, 
because there are already zeros in which to place bits.



N = 5

ABCDEFGHIJKLMxxx
  & rep(0b1)
ABCDEFGHIJKLM...
  << (5 - 1)
.BCDEFGHIJKLM...
  |
AB...BC...CD...DE...EF...FG...GH...HI...IJ...JK...KL...LM...M...
  << (10 - 2)
..CD...DE...EF...FG...GH...HI...IJ...JK...KL...LM...M...
  |
ABCD.BCDE.CDEF.DEFG.EFGH.FGHI.GHIJ.HIJK.IJKL.JKLM.KLM..LM...M...
  & rep(0xf)
ABCDEFGHIJKLM...
  << (20 - 4)
EFGHIJKLM...
  |
ABCDEFGHEFGHIJKLIJKLM...M...
  << (40 - 8)
IJKLM...M...
  |
ABCDEFGHIJKLM...EFGHIJKLM...IJKLM...M...
  & 0xfff8___
ABCDEFGHIJKLM...


It's probably worth working through the various N to make sure you know which masking is 
required.



r~

Re: [PATCH v3 14/37] target/ppc: implement vstri[bh][lr]

2022-02-10 Thread Richard Henderson


On 2/10/22 23:34, matheus.fe...@eldorado.org.br wrote:

+#define VSTRI(NAME, ELEM, NUM_ELEMS, LEFT) \
+void helper_##NAME(CPUPPCState *env, ppc_avr_t *t, ppc_avr_t *b,\
+   target_ulong rc) \
+{   \
+bool null_found = false;\
+int i, idx; \
+\
+for (i = 0; i < NUM_ELEMS; i++) {   \
+idx = LEFT ? i : NUM_ELEMS - i - 1; \
+if (b->Vsr##ELEM(idx)) {\
+t->Vsr##ELEM(idx) = b->Vsr##ELEM(idx);  \
+} else {\
+null_found = true;  \
+break;  \
+}   \
+}   \
+\
+for (; i < NUM_ELEMS; i++) {\
+idx = LEFT ? i : NUM_ELEMS - i - 1; \
+t->Vsr##ELEM(idx) = 0;  \
+}   \
+\
+if (rc) {   \
+env->crf[6] = null_found ? 0b0010 : 0;  \
+}   \
+}


The only reason you're passing in env is for crf[6], which requires you to pass in a 
second argument.  And you're not using the return value.


It would be better to always return the rc value, and only conditionally assign 
it.
E.g.

if (a->rc) {
gen_helper(cpu_crf[6], vrt, vrb);
} else {
TCGv_i32 discard = tcg_temp_new_i32();
gen_helper(discard, vrt, vrb);
tcg_temp_free_i32(discard);
}

r~

Re: [PATCH v3 17/37] target/ppc: implement vcntmb[bhwd]

2022-02-10 Thread Richard Henderson


On 2/10/22 23:34, matheus.fe...@eldorado.org.br wrote:

From: Matheus Ferst

Signed-off-by: Matheus Ferst
---
  target/ppc/insn32.decode|  8 
  target/ppc/translate/vmx-impl.c.inc | 32 +
  2 files changed, 40 insertions(+)


Reviewed-by: Richard Henderson 

r~

Re: [PATCH v3 13/37] target/ppc: Implement Vector Compare Quadword

2022-02-10 Thread Richard Henderson


On 2/10/22 23:34, matheus.fe...@eldorado.org.br wrote:

From: Matheus Ferst

Implement the following PowerISA v3.1 instructions:
vcmpsq: Vector Compare Signed Quadword
vcmpuq: Vector Compare Unsigned Quadword

Signed-off-by: Matheus Ferst
---
  target/ppc/insn32.decode|  6 
  target/ppc/translate/vmx-impl.c.inc | 45 +
  2 files changed, 51 insertions(+)


This one is complex enough it does warrant branches.

Reviewed-by: Richard Henderson 


r~

Re: [PATCH v3 15/37] target/ppc: implement vclrlb

2022-02-10 Thread Richard Henderson


On 2/10/22 23:34, matheus.fe...@eldorado.org.br wrote:

From: Matheus Ferst 

Signed-off-by: Matheus Ferst 
---
  target/ppc/insn32.decode|  2 ++
  target/ppc/translate/vmx-impl.c.inc | 56 +
  2 files changed, 58 insertions(+)

diff --git a/target/ppc/insn32.decode b/target/ppc/insn32.decode
index ea497ecd80..483651cf9c 100644
--- a/target/ppc/insn32.decode
+++ b/target/ppc/insn32.decode
@@ -501,6 +501,8 @@ VSTRIBR 000100 . 1 . . 001101   
@VX_tb_rc
  VSTRIHL 000100 . 00010 . . 001101   @VX_tb_rc
  VSTRIHR 000100 . 00011 . . 001101   @VX_tb_rc
  
+VCLRLB  000100 . . . 00110001101@VX

+
  # VSX Load/Store Instructions
  
  LXV 01 . .  . 001   @DQ_TSX

diff --git a/target/ppc/translate/vmx-impl.c.inc 
b/target/ppc/translate/vmx-impl.c.inc
index 8bcf637ff8..3fb4935bff 100644
--- a/target/ppc/translate/vmx-impl.c.inc
+++ b/target/ppc/translate/vmx-impl.c.inc
@@ -1956,6 +1956,62 @@ TRANS(VSTRIBR, do_vstri, gen_helper_VSTRIBR)
  TRANS(VSTRIHL, do_vstri, gen_helper_VSTRIHL)
  TRANS(VSTRIHR, do_vstri, gen_helper_VSTRIHR)
  
+static bool trans_VCLRLB(DisasContext *ctx, arg_VX *a)

+{
+TCGv_i64 hi, lo, rb;
+TCGLabel *l, *end;
+
+REQUIRE_INSNS_FLAGS2(ctx, ISA310);
+REQUIRE_VECTOR(ctx);
+
+l = gen_new_label();
+end = gen_new_label();
+
+hi = tcg_const_local_i64(0);
+lo = tcg_const_local_i64(0);
+rb = tcg_temp_local_new_i64();
+
+tcg_gen_extu_tl_i64(rb, cpu_gpr[a->vrb]);
+
+/* RB == 0: all zeros */
+tcg_gen_brcondi_i64(TCG_COND_EQ, rb, 0, end);
+
+get_avr64(lo, a->vra, false);
+
+/* RB <= 8 */
+tcg_gen_brcondi_i64(TCG_COND_LEU, rb, 8, l);
+
+get_avr64(hi, a->vra, true);
+
+/* RB >= 16: just copy VRA to VRB */
+tcg_gen_brcondi_i64(TCG_COND_GEU, rb, 16, end);
+
+/* 8 < RB < 16: copy lo and partially clear hi */
+tcg_gen_subfi_i64(rb, 16, rb);
+tcg_gen_shli_i64(rb, rb, 3);
+tcg_gen_shl_i64(hi, hi, rb);
+tcg_gen_shr_i64(hi, hi, rb);
+tcg_gen_br(end);
+
+/* 0 < RB <= 8: zeroes hi and partially clears lo */
+gen_set_label(l);
+tcg_gen_subfi_i64(rb, 8, rb);
+tcg_gen_shli_i64(rb, rb, 3);
+tcg_gen_shl_i64(lo, lo, rb);
+tcg_gen_shr_i64(lo, lo, rb);


There's a bit of redundancy here, and if we exploit that we can remove the 
branches.

Compute the mask modulo 8.  That result applies to either the first or second word, or 
neither.  Use 3 movcond to select among the cases:


   sh = (rb & 7) << 3;
   mask = ~(-1 << sh);
   ml = rb < 8 ? mask : 0;
   mh = rb < 8 ? 0 : mask;
   mh = rb < 16 ? mh : -1;
   lo &= ml;
   hi &= mh;


r~

Re: [PATCH v3 12/37] target/ppc: Implement Vector Compare Greater Than Quadword

2022-02-10 Thread Richard Henderson


On 2/10/22 23:34, matheus.fe...@eldorado.org.br wrote:

+get_avr64(t0, a->vra, true);
+get_avr64(t1, a->vrb, true);
+tcg_gen_brcond_i64(sign ? TCG_COND_GT : TCG_COND_GTU, t0, t1, l1);
+tcg_gen_brcond_i64(sign ? TCG_COND_LT : TCG_COND_LTU, t0, t1, l2);
+
+get_avr64(t0, a->vra, false);
+get_avr64(t1, a->vrb, false);
+tcg_gen_brcond_i64(TCG_COND_GTU, t0, t1, l1);
+tcg_gen_br(l2);


Similarly wrt branches, and computation of the result.


r~

Re: [PATCH v3 11/37] target/ppc: Implement Vector Compare Equal Quadword

2022-02-10 Thread Richard Henderson


On 2/10/22 23:34, matheus.fe...@eldorado.org.br wrote:

From: Matheus Ferst 

Implement the following PowerISA v3.1 instructions:
vcmpequq Vector Compare Equal Quadword

Signed-off-by: Matheus Ferst 
---
  target/ppc/insn32.decode|  1 +
  target/ppc/translate/vmx-impl.c.inc | 43 +
  2 files changed, 44 insertions(+)

diff --git a/target/ppc/insn32.decode b/target/ppc/insn32.decode
index a0adf18671..39730df32d 100644
--- a/target/ppc/insn32.decode
+++ b/target/ppc/insn32.decode
@@ -382,6 +382,7 @@ VCMPEQUB000100 . . . . 000110   @VC
  VCMPEQUH000100 . . . . 0001000110   @VC
  VCMPEQUW000100 . . . . 001110   @VC
  VCMPEQUD000100 . . . . 0011000111   @VC
+VCMPEQUQ000100 . . . . 0111000111   @VC
  
  VCMPGTSB000100 . . . . 110110   @VC

  VCMPGTSH000100 . . . . 1101000110   @VC
diff --git a/target/ppc/translate/vmx-impl.c.inc 
b/target/ppc/translate/vmx-impl.c.inc
index 67059ed9b2..bdb0b4370b 100644
--- a/target/ppc/translate/vmx-impl.c.inc
+++ b/target/ppc/translate/vmx-impl.c.inc
@@ -1112,6 +1112,49 @@ TRANS(VCMPNEZB, do_vcmpnez, MO_8)
  TRANS(VCMPNEZH, do_vcmpnez, MO_16)
  TRANS(VCMPNEZW, do_vcmpnez, MO_32)
  
+static bool trans_VCMPEQUQ(DisasContext *ctx, arg_VC *a)

+{
+TCGv_i64 t0, t1;
+TCGLabel *l1, *l2;
+
+REQUIRE_INSNS_FLAGS2(ctx, ISA310);
+REQUIRE_VECTOR(ctx);
+
+t0 = tcg_temp_new_i64();
+t1 = tcg_temp_new_i64();
+l1 = gen_new_label();
+l2 = gen_new_label();
+
+get_avr64(t0, a->vra, true);
+get_avr64(t1, a->vrb, true);
+tcg_gen_brcond_i64(TCG_COND_NE, t0, t1, l1);
+
+get_avr64(t0, a->vra, false);
+get_avr64(t1, a->vrb, false);
+tcg_gen_brcond_i64(TCG_COND_NE, t0, t1, l1);


It would be much better to not use a branch.
E.g.

get_avr64(t0, a->vra, true);
get_avr64(t1, a->vrb, true);
tcg_gen_xor_i64(c0, t0, t1);

get_avr64(t0, a->vra, false);
get_avr64(t1, a->vrb, false);
tcg_gen_xor_i64(c1, t0, t1);

tcg_gen_or_i64(c0, c0, c1);
tcg_gen_setcondi_i64(TCG_COND_EQ, c0, c0, 0);
tcg_gen_neg_i64(c0, c0);

set_avr64(a->vrt, c0, true);
set_avr64(a->vrt, c0, false);

tcg_gen_extrl_i64_i32(crf, c0);
tcg_gen_andi_i32(crf, crf, 0xa);
tcg_gen_xori_i32(crf, crf, 0x2);


r~

Re: [PATCH v3 10/37] target/ppc: Move Vector Compare Not Equal or Zero to decodetree

2022-02-10 Thread Richard Henderson


On 2/10/22 23:34, matheus.fe...@eldorado.org.br wrote:

+static void gen_vcmpnez_vec(unsigned vece, TCGv_vec t, TCGv_vec a, TCGv_vec b)
+{
+TCGv_vec t0, t1, zero;
+
+t0 = tcg_temp_new_vec_matching(t);
+t1 = tcg_temp_new_vec_matching(t);
+zero = tcg_constant_vec_matching(t, vece, 0);
+
+tcg_gen_cmp_vec(TCG_COND_EQ, vece, t0, a, zero);
+tcg_gen_cmp_vec(TCG_COND_EQ, vece, t1, b, zero);
+tcg_gen_cmp_vec(TCG_COND_NE, vece, t, a, b);
+
+tcg_gen_or_vec(vece, t, t, t0);
+tcg_gen_or_vec(vece, t, t, t1);
+
+tcg_gen_shli_vec(vece, t, t, (8 << vece) - 1);
+tcg_gen_sari_vec(vece, t, t, (8 << vece) - 1);


No shifting required, only the cmp.


+static bool do_vcmpnez(DisasContext *ctx, arg_VC *a, int vece)
+{
+static const TCGOpcode vecop_list[] = {
+INDEX_op_cmp_vec, INDEX_op_shli_vec, INDEX_op_sari_vec, 0
+};


Therefore no vecop_list required (cmp itself is mandatory).

r~

[PATCH v6 0/6] support subsets of Float-Point in Integer Registers extensions

2022-02-10 Thread Weiwei Li

This patchset implements RISC-V Float-Point in Integer Registers 
extensions(Version 1.0), which includes Zfinx, Zdinx, Zhinx and Zhinxmin 
extension. 

Specification:
https://github.com/riscv/riscv-zfinx/blob/main/zfinx-1.0.0.pdf

The port is available here:
https://github.com/plctlab/plct-qemu/tree/plct-zfinx-upstream-v6

To test this implementation, specify cpu argument with 'zfinx 
=true,zdinx=true,zhinx=true,zhinxmin=true' with 
'g=false,f=false,d=false,Zfh=false,Zfhmin=false'
This implementation can pass gcc tests, ci result can be found in 
https://ci.rvperf.org/job/plct-qemu-zfinx-upstream/.

v6:
* rename flags Z*inx to z*inx
* rebase on apply-to-riscv.next

v5:
* put definition of ftemp and nftemp together, add comments for them
* sperate the declare of variable i from loop 

v4:
* combine register pair check for rv32 zdinx
* clear mstatus.FS when RVF is disabled by write_misa

v3:
* delete unused reset for mstatus.FS
* use positive test for RVF instead of negative test for ZFINX
* replace get_ol with get_xl
* use tcg_gen_concat_tl_i64 to unify tcg_gen_concat_i32_i64 and 
tcg_gen_deposit_i64

v2:
* hardwire mstatus.FS to zero when enable zfinx
* do register-pair check at the begin of translation
* optimize partial implemention as suggested

Weiwei Li (6):
  target/riscv: add cfg properties for zfinx, zdinx and zhinx{min}
  target/riscv: hardwire mstatus.FS to zero when enable zfinx
  target/riscv: add support for zfinx
  target/riscv: add support for zdinx
  target/riscv: add support for zhinx/zhinxmin
  target/riscv: expose zfinx, zdinx, zhinx{min} properties

 target/riscv/cpu.c|  17 ++
 target/riscv/cpu.h|   4 +
 target/riscv/cpu_helper.c |   6 +-
 target/riscv/csr.c|  25 +-
 target/riscv/fpu_helper.c | 178 ++--
 target/riscv/helper.h |   4 +-
 target/riscv/insn_trans/trans_rvd.c.inc   | 285 ++-
 target/riscv/insn_trans/trans_rvf.c.inc   | 314 +---
 target/riscv/insn_trans/trans_rvzfh.c.inc | 332 +++---
 target/riscv/internals.h  |  32 ++-
 target/riscv/translate.c  | 149 +-
 11 files changed, 974 insertions(+), 372 deletions(-)

-- 
2.17.1

[PATCH v6 4/6] target/riscv: add support for zdinx

2022-02-10 Thread Weiwei Li

  -- update extension check REQUIRE_ZDINX_OR_D
  -- update double float point register read/write

Co-authored-by: ardxwe 
Signed-off-by: Weiwei Li 
Signed-off-by: Junqiang Wang 
Reviewed-by: Richard Henderson 
---
 target/riscv/insn_trans/trans_rvd.c.inc | 285 +---
 target/riscv/translate.c|  52 +
 2 files changed, 259 insertions(+), 78 deletions(-)

diff --git a/target/riscv/insn_trans/trans_rvd.c.inc 
b/target/riscv/insn_trans/trans_rvd.c.inc
index 091ed3a8ad..1397c1ce1c 100644
--- a/target/riscv/insn_trans/trans_rvd.c.inc
+++ b/target/riscv/insn_trans/trans_rvd.c.inc
@@ -18,6 +18,19 @@
  * this program.  If not, see .
  */
 
+#define REQUIRE_ZDINX_OR_D(ctx) do { \
+if (!ctx->cfg_ptr->ext_zdinx) { \
+REQUIRE_EXT(ctx, RVD); \
+} \
+} while (0)
+
+#define REQUIRE_EVEN(ctx, reg) do { \
+if (ctx->cfg_ptr->ext_zdinx && (get_xl(ctx) == MXL_RV32) && \
+((reg) & 0x1)) { \
+return false; \
+} \
+} while (0)
+
 static bool trans_fld(DisasContext *ctx, arg_fld *a)
 {
 TCGv addr;
@@ -47,10 +60,17 @@ static bool trans_fsd(DisasContext *ctx, arg_fsd *a)
 static bool trans_fmadd_d(DisasContext *ctx, arg_fmadd_d *a)
 {
 REQUIRE_FPU;
-REQUIRE_EXT(ctx, RVD);
+REQUIRE_ZDINX_OR_D(ctx);
+REQUIRE_EVEN(ctx, a->rd | a->rs1 | a->rs2 | a->rs3);
+
+TCGv_i64 dest = dest_fpr(ctx, a->rd);
+TCGv_i64 src1 = get_fpr_d(ctx, a->rs1);
+TCGv_i64 src2 = get_fpr_d(ctx, a->rs2);
+TCGv_i64 src3 = get_fpr_d(ctx, a->rs3);
+
 gen_set_rm(ctx, a->rm);
-gen_helper_fmadd_d(cpu_fpr[a->rd], cpu_env, cpu_fpr[a->rs1],
-   cpu_fpr[a->rs2], cpu_fpr[a->rs3]);
+gen_helper_fmadd_d(dest, cpu_env, src1, src2, src3);
+gen_set_fpr_d(ctx, a->rd, dest);
 mark_fs_dirty(ctx);
 return true;
 }
@@ -58,10 +78,17 @@ static bool trans_fmadd_d(DisasContext *ctx, arg_fmadd_d *a)
 static bool trans_fmsub_d(DisasContext *ctx, arg_fmsub_d *a)
 {
 REQUIRE_FPU;
-REQUIRE_EXT(ctx, RVD);
+REQUIRE_ZDINX_OR_D(ctx);
+REQUIRE_EVEN(ctx, a->rd | a->rs1 | a->rs2 | a->rs3);
+
+TCGv_i64 dest = dest_fpr(ctx, a->rd);
+TCGv_i64 src1 = get_fpr_d(ctx, a->rs1);
+TCGv_i64 src2 = get_fpr_d(ctx, a->rs2);
+TCGv_i64 src3 = get_fpr_d(ctx, a->rs3);
+
 gen_set_rm(ctx, a->rm);
-gen_helper_fmsub_d(cpu_fpr[a->rd], cpu_env, cpu_fpr[a->rs1],
-   cpu_fpr[a->rs2], cpu_fpr[a->rs3]);
+gen_helper_fmsub_d(dest, cpu_env, src1, src2, src3);
+gen_set_fpr_d(ctx, a->rd, dest);
 mark_fs_dirty(ctx);
 return true;
 }
@@ -69,10 +96,17 @@ static bool trans_fmsub_d(DisasContext *ctx, arg_fmsub_d *a)
 static bool trans_fnmsub_d(DisasContext *ctx, arg_fnmsub_d *a)
 {
 REQUIRE_FPU;
-REQUIRE_EXT(ctx, RVD);
+REQUIRE_ZDINX_OR_D(ctx);
+REQUIRE_EVEN(ctx, a->rd | a->rs1 | a->rs2 | a->rs3);
+
+TCGv_i64 dest = dest_fpr(ctx, a->rd);
+TCGv_i64 src1 = get_fpr_d(ctx, a->rs1);
+TCGv_i64 src2 = get_fpr_d(ctx, a->rs2);
+TCGv_i64 src3 = get_fpr_d(ctx, a->rs3);
+
 gen_set_rm(ctx, a->rm);
-gen_helper_fnmsub_d(cpu_fpr[a->rd], cpu_env, cpu_fpr[a->rs1],
-cpu_fpr[a->rs2], cpu_fpr[a->rs3]);
+gen_helper_fnmsub_d(dest, cpu_env, src1, src2, src3);
+gen_set_fpr_d(ctx, a->rd, dest);
 mark_fs_dirty(ctx);
 return true;
 }
@@ -80,10 +114,17 @@ static bool trans_fnmsub_d(DisasContext *ctx, arg_fnmsub_d 
*a)
 static bool trans_fnmadd_d(DisasContext *ctx, arg_fnmadd_d *a)
 {
 REQUIRE_FPU;
-REQUIRE_EXT(ctx, RVD);
+REQUIRE_ZDINX_OR_D(ctx);
+REQUIRE_EVEN(ctx, a->rd | a->rs1 | a->rs2 | a->rs3);
+
+TCGv_i64 dest = dest_fpr(ctx, a->rd);
+TCGv_i64 src1 = get_fpr_d(ctx, a->rs1);
+TCGv_i64 src2 = get_fpr_d(ctx, a->rs2);
+TCGv_i64 src3 = get_fpr_d(ctx, a->rs3);
+
 gen_set_rm(ctx, a->rm);
-gen_helper_fnmadd_d(cpu_fpr[a->rd], cpu_env, cpu_fpr[a->rs1],
-cpu_fpr[a->rs2], cpu_fpr[a->rs3]);
+gen_helper_fnmadd_d(dest, cpu_env, src1, src2, src3);
+gen_set_fpr_d(ctx, a->rd, dest);
 mark_fs_dirty(ctx);
 return true;
 }
@@ -91,12 +132,16 @@ static bool trans_fnmadd_d(DisasContext *ctx, arg_fnmadd_d 
*a)
 static bool trans_fadd_d(DisasContext *ctx, arg_fadd_d *a)
 {
 REQUIRE_FPU;
-REQUIRE_EXT(ctx, RVD);
+REQUIRE_ZDINX_OR_D(ctx);
+REQUIRE_EVEN(ctx, a->rd | a->rs1 | a->rs2);
 
-gen_set_rm(ctx, a->rm);
-gen_helper_fadd_d(cpu_fpr[a->rd], cpu_env,
-  cpu_fpr[a->rs1], cpu_fpr[a->rs2]);
+TCGv_i64 dest = dest_fpr(ctx, a->rd);
+TCGv_i64 src1 = get_fpr_d(ctx, a->rs1);
+TCGv_i64 src2 = get_fpr_d(ctx, a->rs2);
 
+gen_set_rm(ctx, a->rm);
+gen_helper_fadd_d(dest, cpu_env, src1, src2);
+gen_set_fpr_d(ctx, a->rd, dest);
 mark_fs_dirty(ctx);
 return true;
 }
@@ -104,12 +149,16 @@ static bool trans_fadd_d(DisasContext *ctx, arg_fadd_d *a)
 static bool trans_fsub_d(DisasContext *ctx, arg_fsub_d *a)
 {
 REQ

[PATCH v6 5/6] target/riscv: add support for zhinx/zhinxmin

2022-02-10 Thread Weiwei Li

  - update extension check REQUIRE_ZHINX_OR_ZFH and 
REQUIRE_ZFH_OR_ZFHMIN_OR_ZHINX_OR_ZHINXMIN
  - update half float point register read/write
  - disable nanbox_h check

Signed-off-by: Weiwei Li 
Signed-off-by: Junqiang Wang 
Reviewed-by: Richard Henderson 
---
 target/riscv/fpu_helper.c |  89 +++---
 target/riscv/helper.h |   2 +-
 target/riscv/insn_trans/trans_rvzfh.c.inc | 332 +++---
 target/riscv/internals.h  |  16 +-
 4 files changed, 296 insertions(+), 143 deletions(-)

diff --git a/target/riscv/fpu_helper.c b/target/riscv/fpu_helper.c
index 63ca703459..5699c9517f 100644
--- a/target/riscv/fpu_helper.c
+++ b/target/riscv/fpu_helper.c
@@ -89,10 +89,11 @@ void helper_set_rod_rounding_mode(CPURISCVState *env)
 static uint64_t do_fmadd_h(CPURISCVState *env, uint64_t rs1, uint64_t rs2,
uint64_t rs3, int flags)
 {
-float16 frs1 = check_nanbox_h(rs1);
-float16 frs2 = check_nanbox_h(rs2);
-float16 frs3 = check_nanbox_h(rs3);
-return nanbox_h(float16_muladd(frs1, frs2, frs3, flags, &env->fp_status));
+float16 frs1 = check_nanbox_h(env, rs1);
+float16 frs2 = check_nanbox_h(env, rs2);
+float16 frs3 = check_nanbox_h(env, rs3);
+return nanbox_h(env, float16_muladd(frs1, frs2, frs3, flags,
+&env->fp_status));
 }
 
 static uint64_t do_fmadd_s(CPURISCVState *env, uint64_t rs1, uint64_t rs2,
@@ -417,146 +418,146 @@ target_ulong helper_fclass_d(uint64_t frs1)
 
 uint64_t helper_fadd_h(CPURISCVState *env, uint64_t rs1, uint64_t rs2)
 {
-float16 frs1 = check_nanbox_h(rs1);
-float16 frs2 = check_nanbox_h(rs2);
-return nanbox_h(float16_add(frs1, frs2, &env->fp_status));
+float16 frs1 = check_nanbox_h(env, rs1);
+float16 frs2 = check_nanbox_h(env, rs2);
+return nanbox_h(env, float16_add(frs1, frs2, &env->fp_status));
 }
 
 uint64_t helper_fsub_h(CPURISCVState *env, uint64_t rs1, uint64_t rs2)
 {
-float16 frs1 = check_nanbox_h(rs1);
-float16 frs2 = check_nanbox_h(rs2);
-return nanbox_h(float16_sub(frs1, frs2, &env->fp_status));
+float16 frs1 = check_nanbox_h(env, rs1);
+float16 frs2 = check_nanbox_h(env, rs2);
+return nanbox_h(env, float16_sub(frs1, frs2, &env->fp_status));
 }
 
 uint64_t helper_fmul_h(CPURISCVState *env, uint64_t rs1, uint64_t rs2)
 {
-float16 frs1 = check_nanbox_h(rs1);
-float16 frs2 = check_nanbox_h(rs2);
-return nanbox_h(float16_mul(frs1, frs2, &env->fp_status));
+float16 frs1 = check_nanbox_h(env, rs1);
+float16 frs2 = check_nanbox_h(env, rs2);
+return nanbox_h(env, float16_mul(frs1, frs2, &env->fp_status));
 }
 
 uint64_t helper_fdiv_h(CPURISCVState *env, uint64_t rs1, uint64_t rs2)
 {
-float16 frs1 = check_nanbox_h(rs1);
-float16 frs2 = check_nanbox_h(rs2);
-return nanbox_h(float16_div(frs1, frs2, &env->fp_status));
+float16 frs1 = check_nanbox_h(env, rs1);
+float16 frs2 = check_nanbox_h(env, rs2);
+return nanbox_h(env, float16_div(frs1, frs2, &env->fp_status));
 }
 
 uint64_t helper_fmin_h(CPURISCVState *env, uint64_t rs1, uint64_t rs2)
 {
-float16 frs1 = check_nanbox_h(rs1);
-float16 frs2 = check_nanbox_h(rs2);
-return nanbox_h(env->priv_ver < PRIV_VERSION_1_11_0 ?
+float16 frs1 = check_nanbox_h(env, rs1);
+float16 frs2 = check_nanbox_h(env, rs2);
+return nanbox_h(env, env->priv_ver < PRIV_VERSION_1_11_0 ?
 float16_minnum(frs1, frs2, &env->fp_status) :
 float16_minimum_number(frs1, frs2, &env->fp_status));
 }
 
 uint64_t helper_fmax_h(CPURISCVState *env, uint64_t rs1, uint64_t rs2)
 {
-float16 frs1 = check_nanbox_h(rs1);
-float16 frs2 = check_nanbox_h(rs2);
-return nanbox_h(env->priv_ver < PRIV_VERSION_1_11_0 ?
+float16 frs1 = check_nanbox_h(env, rs1);
+float16 frs2 = check_nanbox_h(env, rs2);
+return nanbox_h(env, env->priv_ver < PRIV_VERSION_1_11_0 ?
 float16_maxnum(frs1, frs2, &env->fp_status) :
 float16_maximum_number(frs1, frs2, &env->fp_status));
 }
 
 uint64_t helper_fsqrt_h(CPURISCVState *env, uint64_t rs1)
 {
-float16 frs1 = check_nanbox_h(rs1);
-return nanbox_h(float16_sqrt(frs1, &env->fp_status));
+float16 frs1 = check_nanbox_h(env, rs1);
+return nanbox_h(env, float16_sqrt(frs1, &env->fp_status));
 }
 
 target_ulong helper_fle_h(CPURISCVState *env, uint64_t rs1, uint64_t rs2)
 {
-float16 frs1 = check_nanbox_h(rs1);
-float16 frs2 = check_nanbox_h(rs2);
+float16 frs1 = check_nanbox_h(env, rs1);
+float16 frs2 = check_nanbox_h(env, rs2);
 return float16_le(frs1, frs2, &env->fp_status);
 }
 
 target_ulong helper_flt_h(CPURISCVState *env, uint64_t rs1, uint64_t rs2)
 {
-float16 frs1 = check_nanbox_h(rs1);
-float16 frs2 = check_nanbox_h(rs2);
+float16 frs1 = check_nanbox_h(env, rs1);
+float16 frs2 = check_nanbox_h(env, rs2);
 return float16_lt(frs1, frs

[PATCH v6 6/6] target/riscv: expose zfinx, zdinx, zhinx{min} properties

2022-02-10 Thread Weiwei Li

Co-authored-by: ardxwe 
Signed-off-by: Weiwei Li 
Signed-off-by: Junqiang Wang 
Reviewed-by: Richard Henderson 
Reviewed-by: Alistair Francis 
---
 target/riscv/cpu.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
index 55371b1aa5..ddda4906ff 100644
--- a/target/riscv/cpu.c
+++ b/target/riscv/cpu.c
@@ -795,6 +795,11 @@ static Property riscv_cpu_properties[] = {
 DEFINE_PROP_BOOL("zbc", RISCVCPU, cfg.ext_zbc, true),
 DEFINE_PROP_BOOL("zbs", RISCVCPU, cfg.ext_zbs, true),
 
+DEFINE_PROP_BOOL("zdinx", RISCVCPU, cfg.ext_zdinx, false),
+DEFINE_PROP_BOOL("zfinx", RISCVCPU, cfg.ext_zfinx, false),
+DEFINE_PROP_BOOL("zhinx", RISCVCPU, cfg.ext_zhinx, false),
+DEFINE_PROP_BOOL("zhinxmin", RISCVCPU, cfg.ext_zhinxmin, false),
+
 /* Vendor-specific custom extensions */
 DEFINE_PROP_BOOL("xventanacondops", RISCVCPU, cfg.ext_XVentanaCondOps, 
false),
 
-- 
2.17.1

[PATCH v6 2/6] target/riscv: hardwire mstatus.FS to zero when enable zfinx

2022-02-10 Thread Weiwei Li

Co-authored-by: ardxwe 
Signed-off-by: Weiwei Li 
Signed-off-by: Junqiang Wang 
Reviewed-by: Alistair Francis 
---
 target/riscv/cpu_helper.c |  6 +-
 target/riscv/csr.c| 25 -
 target/riscv/translate.c  |  4 
 3 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/target/riscv/cpu_helper.c b/target/riscv/cpu_helper.c
index 746335bfd6..1c60fb2e80 100644
--- a/target/riscv/cpu_helper.c
+++ b/target/riscv/cpu_helper.c
@@ -466,9 +466,13 @@ bool riscv_cpu_vector_enabled(CPURISCVState *env)
 
 void riscv_cpu_swap_hypervisor_regs(CPURISCVState *env)
 {
-uint64_t mstatus_mask = MSTATUS_MXR | MSTATUS_SUM | MSTATUS_FS |
+uint64_t mstatus_mask = MSTATUS_MXR | MSTATUS_SUM |
 MSTATUS_SPP | MSTATUS_SPIE | MSTATUS_SIE |
 MSTATUS64_UXL | MSTATUS_VS;
+
+if (riscv_has_ext(env, RVF)) {
+mstatus_mask |= MSTATUS_FS;
+}
 bool current_virt = riscv_cpu_virt_enabled(env);
 
 g_assert(riscv_has_ext(env, RVH));
diff --git a/target/riscv/csr.c b/target/riscv/csr.c
index 387088a86c..93bba1ca1c 100644
--- a/target/riscv/csr.c
+++ b/target/riscv/csr.c
@@ -38,7 +38,8 @@ void riscv_set_csr_ops(int csrno, riscv_csr_operations *ops)
 static RISCVException fs(CPURISCVState *env, int csrno)
 {
 #if !defined(CONFIG_USER_ONLY)
-if (!env->debugger && !riscv_cpu_fp_enabled(env)) {
+if (!env->debugger && !riscv_cpu_fp_enabled(env) &&
+!RISCV_CPU(env_cpu(env))->cfg.ext_zfinx) {
 return RISCV_EXCP_ILLEGAL_INST;
 }
 #endif
@@ -301,7 +302,9 @@ static RISCVException write_fflags(CPURISCVState *env, int 
csrno,
target_ulong val)
 {
 #if !defined(CONFIG_USER_ONLY)
-env->mstatus |= MSTATUS_FS;
+if (riscv_has_ext(env, RVF)) {
+env->mstatus |= MSTATUS_FS;
+}
 #endif
 riscv_cpu_set_fflags(env, val & (FSR_AEXC >> FSR_AEXC_SHIFT));
 return RISCV_EXCP_NONE;
@@ -318,7 +321,9 @@ static RISCVException write_frm(CPURISCVState *env, int 
csrno,
 target_ulong val)
 {
 #if !defined(CONFIG_USER_ONLY)
-env->mstatus |= MSTATUS_FS;
+if (riscv_has_ext(env, RVF)) {
+env->mstatus |= MSTATUS_FS;
+}
 #endif
 env->frm = val & (FSR_RD >> FSR_RD_SHIFT);
 return RISCV_EXCP_NONE;
@@ -336,7 +341,9 @@ static RISCVException write_fcsr(CPURISCVState *env, int 
csrno,
  target_ulong val)
 {
 #if !defined(CONFIG_USER_ONLY)
-env->mstatus |= MSTATUS_FS;
+if (riscv_has_ext(env, RVF)) {
+env->mstatus |= MSTATUS_FS;
+}
 #endif
 env->frm = (val & FSR_RD) >> FSR_RD_SHIFT;
 riscv_cpu_set_fflags(env, (val & FSR_AEXC) >> FSR_AEXC_SHIFT);
@@ -652,10 +659,14 @@ static RISCVException write_mstatus(CPURISCVState *env, 
int csrno,
 tlb_flush(env_cpu(env));
 }
 mask = MSTATUS_SIE | MSTATUS_SPIE | MSTATUS_MIE | MSTATUS_MPIE |
-MSTATUS_SPP | MSTATUS_FS | MSTATUS_MPRV | MSTATUS_SUM |
+MSTATUS_SPP | MSTATUS_MPRV | MSTATUS_SUM |
 MSTATUS_MPP | MSTATUS_MXR | MSTATUS_TVM | MSTATUS_TSR |
 MSTATUS_TW | MSTATUS_VS;
 
+if (riscv_has_ext(env, RVF)) {
+mask |= MSTATUS_FS;
+}
+
 if (xl != MXL_RV32 || env->debugger) {
 /*
  * RV32: MPV and GVA are not in mstatus. The current plan is to
@@ -787,6 +798,10 @@ static RISCVException write_misa(CPURISCVState *env, int 
csrno,
 return RISCV_EXCP_NONE;
 }
 
+if (!(val & RVF)) {
+env->mstatus &= ~MSTATUS_FS;
+}
+
 /* flush translation cache */
 tb_flush(env_cpu(env));
 env->misa_ext = val;
diff --git a/target/riscv/translate.c b/target/riscv/translate.c
index 84dbfa6340..c7232de326 100644
--- a/target/riscv/translate.c
+++ b/target/riscv/translate.c
@@ -426,6 +426,10 @@ static void mark_fs_dirty(DisasContext *ctx)
 {
 TCGv tmp;
 
+if (!has_ext(ctx, RVF)) {
+return;
+}
+
 if (ctx->mstatus_fs != MSTATUS_FS) {
 /* Remember the state change for the rest of the TB. */
 ctx->mstatus_fs = MSTATUS_FS;
-- 
2.17.1

[PATCH v6 1/6] target/riscv: add cfg properties for zfinx, zdinx and zhinx{min}

2022-02-10 Thread Weiwei Li

Co-authored-by: ardxwe 
Signed-off-by: Weiwei Li 
Signed-off-by: Junqiang Wang 
Reviewed-by: Richard Henderson 
Reviewed-by: Alistair Francis 
---
 target/riscv/cpu.c | 12 
 target/riscv/cpu.h |  4 
 2 files changed, 16 insertions(+)

diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
index b0a40b83e7..55371b1aa5 100644
--- a/target/riscv/cpu.c
+++ b/target/riscv/cpu.c
@@ -587,6 +587,11 @@ static void riscv_cpu_realize(DeviceState *dev, Error 
**errp)
 cpu->cfg.ext_d = true;
 }
 
+if (cpu->cfg.ext_zdinx || cpu->cfg.ext_zhinx ||
+cpu->cfg.ext_zhinxmin) {
+cpu->cfg.ext_zfinx = true;
+}
+
 /* Set the ISA extensions, checks should have happened above */
 if (cpu->cfg.ext_i) {
 ext |= RVI;
@@ -665,6 +670,13 @@ static void riscv_cpu_realize(DeviceState *dev, Error 
**errp)
 if (cpu->cfg.ext_j) {
 ext |= RVJ;
 }
+if (cpu->cfg.ext_zfinx && ((ext & (RVF | RVD)) || cpu->cfg.ext_zfh ||
+   cpu->cfg.ext_zfhmin)) {
+error_setg(errp,
+"'Zfinx' cannot be supported together with 'F', 'D', 
'Zfh',"
+" 'Zfhmin'");
+return;
+}
 
 set_misa(env, env->misa_mxl, ext);
 }
diff --git a/target/riscv/cpu.h b/target/riscv/cpu.h
index 8183fb86d5..9ba05042ed 100644
--- a/target/riscv/cpu.h
+++ b/target/riscv/cpu.h
@@ -362,8 +362,12 @@ struct RISCVCPUConfig {
 bool ext_svinval;
 bool ext_svnapot;
 bool ext_svpbmt;
+bool ext_zdinx;
 bool ext_zfh;
 bool ext_zfhmin;
+bool ext_zfinx;
+bool ext_zhinx;
+bool ext_zhinxmin;
 bool ext_zve32f;
 bool ext_zve64f;
 
-- 
2.17.1

[PATCH v6 3/6] target/riscv: add support for zfinx

2022-02-10 Thread Weiwei Li

  - update extension check REQUIRE_ZFINX_OR_F
  - update single float point register read/write
  - disable nanbox_s check

Co-authored-by: ardxwe 
Signed-off-by: Weiwei Li 
Signed-off-by: Junqiang Wang 
Reviewed-by: Richard Henderson 
---
 target/riscv/fpu_helper.c   |  89 +++
 target/riscv/helper.h   |   2 +-
 target/riscv/insn_trans/trans_rvf.c.inc | 314 
 target/riscv/internals.h|  16 +-
 target/riscv/translate.c|  93 ++-
 5 files changed, 369 insertions(+), 145 deletions(-)

diff --git a/target/riscv/fpu_helper.c b/target/riscv/fpu_helper.c
index 4a5982d594..63ca703459 100644
--- a/target/riscv/fpu_helper.c
+++ b/target/riscv/fpu_helper.c
@@ -98,10 +98,11 @@ static uint64_t do_fmadd_h(CPURISCVState *env, uint64_t 
rs1, uint64_t rs2,
 static uint64_t do_fmadd_s(CPURISCVState *env, uint64_t rs1, uint64_t rs2,
uint64_t rs3, int flags)
 {
-float32 frs1 = check_nanbox_s(rs1);
-float32 frs2 = check_nanbox_s(rs2);
-float32 frs3 = check_nanbox_s(rs3);
-return nanbox_s(float32_muladd(frs1, frs2, frs3, flags, &env->fp_status));
+float32 frs1 = check_nanbox_s(env, rs1);
+float32 frs2 = check_nanbox_s(env, rs2);
+float32 frs3 = check_nanbox_s(env, rs3);
+return nanbox_s(env, float32_muladd(frs1, frs2, frs3, flags,
+&env->fp_status));
 }
 
 uint64_t helper_fmadd_s(CPURISCVState *env, uint64_t frs1, uint64_t frs2,
@@ -183,124 +184,124 @@ uint64_t helper_fnmadd_h(CPURISCVState *env, uint64_t 
frs1, uint64_t frs2,
 
 uint64_t helper_fadd_s(CPURISCVState *env, uint64_t rs1, uint64_t rs2)
 {
-float32 frs1 = check_nanbox_s(rs1);
-float32 frs2 = check_nanbox_s(rs2);
-return nanbox_s(float32_add(frs1, frs2, &env->fp_status));
+float32 frs1 = check_nanbox_s(env, rs1);
+float32 frs2 = check_nanbox_s(env, rs2);
+return nanbox_s(env, float32_add(frs1, frs2, &env->fp_status));
 }
 
 uint64_t helper_fsub_s(CPURISCVState *env, uint64_t rs1, uint64_t rs2)
 {
-float32 frs1 = check_nanbox_s(rs1);
-float32 frs2 = check_nanbox_s(rs2);
-return nanbox_s(float32_sub(frs1, frs2, &env->fp_status));
+float32 frs1 = check_nanbox_s(env, rs1);
+float32 frs2 = check_nanbox_s(env, rs2);
+return nanbox_s(env, float32_sub(frs1, frs2, &env->fp_status));
 }
 
 uint64_t helper_fmul_s(CPURISCVState *env, uint64_t rs1, uint64_t rs2)
 {
-float32 frs1 = check_nanbox_s(rs1);
-float32 frs2 = check_nanbox_s(rs2);
-return nanbox_s(float32_mul(frs1, frs2, &env->fp_status));
+float32 frs1 = check_nanbox_s(env, rs1);
+float32 frs2 = check_nanbox_s(env, rs2);
+return nanbox_s(env, float32_mul(frs1, frs2, &env->fp_status));
 }
 
 uint64_t helper_fdiv_s(CPURISCVState *env, uint64_t rs1, uint64_t rs2)
 {
-float32 frs1 = check_nanbox_s(rs1);
-float32 frs2 = check_nanbox_s(rs2);
-return nanbox_s(float32_div(frs1, frs2, &env->fp_status));
+float32 frs1 = check_nanbox_s(env, rs1);
+float32 frs2 = check_nanbox_s(env, rs2);
+return nanbox_s(env, float32_div(frs1, frs2, &env->fp_status));
 }
 
 uint64_t helper_fmin_s(CPURISCVState *env, uint64_t rs1, uint64_t rs2)
 {
-float32 frs1 = check_nanbox_s(rs1);
-float32 frs2 = check_nanbox_s(rs2);
-return nanbox_s(env->priv_ver < PRIV_VERSION_1_11_0 ?
+float32 frs1 = check_nanbox_s(env, rs1);
+float32 frs2 = check_nanbox_s(env, rs2);
+return nanbox_s(env, env->priv_ver < PRIV_VERSION_1_11_0 ?
 float32_minnum(frs1, frs2, &env->fp_status) :
 float32_minimum_number(frs1, frs2, &env->fp_status));
 }
 
 uint64_t helper_fmax_s(CPURISCVState *env, uint64_t rs1, uint64_t rs2)
 {
-float32 frs1 = check_nanbox_s(rs1);
-float32 frs2 = check_nanbox_s(rs2);
-return nanbox_s(env->priv_ver < PRIV_VERSION_1_11_0 ?
+float32 frs1 = check_nanbox_s(env, rs1);
+float32 frs2 = check_nanbox_s(env, rs2);
+return nanbox_s(env, env->priv_ver < PRIV_VERSION_1_11_0 ?
 float32_maxnum(frs1, frs2, &env->fp_status) :
 float32_maximum_number(frs1, frs2, &env->fp_status));
 }
 
 uint64_t helper_fsqrt_s(CPURISCVState *env, uint64_t rs1)
 {
-float32 frs1 = check_nanbox_s(rs1);
-return nanbox_s(float32_sqrt(frs1, &env->fp_status));
+float32 frs1 = check_nanbox_s(env, rs1);
+return nanbox_s(env, float32_sqrt(frs1, &env->fp_status));
 }
 
 target_ulong helper_fle_s(CPURISCVState *env, uint64_t rs1, uint64_t rs2)
 {
-float32 frs1 = check_nanbox_s(rs1);
-float32 frs2 = check_nanbox_s(rs2);
+float32 frs1 = check_nanbox_s(env, rs1);
+float32 frs2 = check_nanbox_s(env, rs2);
 return float32_le(frs1, frs2, &env->fp_status);
 }
 
 target_ulong helper_flt_s(CPURISCVState *env, uint64_t rs1, uint64_t rs2)
 {
-float32 frs1 = check_nanbox_s(rs1);
-float32 frs2 = check_nanbox_s(rs2);
+float32 frs1 = check_nanbox_s(env, rs1);

Re: [PATCH v3 09/37] target/ppc: Move Vector Compare Equal/Not Equal/Greater Than to decodetree

2022-02-10 Thread Richard Henderson


On 2/10/22 23:34, matheus.fe...@eldorado.org.br wrote:

+static void do_vcmp_rc(int vrt)
+{
+TCGv_i64 t0, t1;
+
+t0 = tcg_temp_new_i64();
+t1 = tcg_temp_new_i64();
+
+get_avr64(t0, vrt, true);
+tcg_gen_ctpop_i64(t1, t0);
+get_avr64(t0, vrt, false);
+tcg_gen_ctpop_i64(t0, t0);
+tcg_gen_add_i64(t1, t0, t1);


I don't understand the ctpop here.  I would have expected:

tcg_gen_and_i64(set, t0, t1);
tcg_gen_or_i64(clr, t0, t1);
tcg_gen_setcondi_i64(TCG_COND_EQ, set, set, -1); /* all bits set */
tcg_gen_setcondi_i64(TCG_COND_EQ, clr, clr, 0);  /* all bits clear */



+static bool do_vcmp(DisasContext *ctx, arg_VC *a, TCGCond cond, int vece)
+{
+REQUIRE_VECTOR(ctx);
+
+tcg_gen_gvec_cmp(cond, vece, avr_full_offset(a->vrt),
+ avr_full_offset(a->vra), avr_full_offset(a->vrb), 16, 16);
+tcg_gen_gvec_shli(vece, avr_full_offset(a->vrt), avr_full_offset(a->vrt),
+  (8 << vece) - 1, 16, 16);
+tcg_gen_gvec_sari(vece, avr_full_offset(a->vrt), avr_full_offset(a->vrt),
+  (8 << vece) - 1, 16, 16);


Vector compare already produces -1; no need for anything beyond the cmp.


r~

Re: [PATCH v3 08/37] target/ppc: Implement vextsd2q

2022-02-10 Thread Richard Henderson


On 2/10/22 23:34, matheus.fe...@eldorado.org.br wrote:

From: Lucas Coutinho

Signed-off-by: Lucas Coutinho
Signed-off-by: Matheus Ferst
---
  target/ppc/insn32.decode|  1 +
  target/ppc/translate/vmx-impl.c.inc | 18 ++
  2 files changed, 19 insertions(+)


Reviewed-by: Richard Henderson 

r~

Re: [PATCH v3 07/37] target/ppc: Move vexts[bhw]2[wd] to decodetree

2022-02-10 Thread Richard Henderson


On 2/10/22 23:34, matheus.fe...@eldorado.org.br wrote:

From: Lucas Coutinho 

Move the following instructions to decodetree:
vextsb2w: Vector Extend Sign Byte To Word
vextsh2w: Vector Extend Sign Halfword To Word
vextsb2d: Vector Extend Sign Byte To Doubleword
vextsh2d: Vector Extend Sign Halfword To Doubleword
vextsw2d: Vector Extend Sign Word To Doubleword

Signed-off-by: Lucas Coutinho 
Signed-off-by: Matheus Ferst 
---
  target/ppc/helper.h |  5 -
  target/ppc/insn32.decode|  8 
  target/ppc/int_helper.c | 15 ---
  target/ppc/translate/vmx-impl.c.inc | 25 -
  target/ppc/translate/vmx-ops.c.inc  |  5 -
  5 files changed, 28 insertions(+), 30 deletions(-)

diff --git a/target/ppc/helper.h b/target/ppc/helper.h
index 92595a42df..0084080fad 100644
--- a/target/ppc/helper.h
+++ b/target/ppc/helper.h
@@ -249,11 +249,6 @@ DEF_HELPER_4(VINSBLX, void, env, avr, i64, tl)
  DEF_HELPER_4(VINSHLX, void, env, avr, i64, tl)
  DEF_HELPER_4(VINSWLX, void, env, avr, i64, tl)
  DEF_HELPER_4(VINSDLX, void, env, avr, i64, tl)
-DEF_HELPER_2(vextsb2w, void, avr, avr)
-DEF_HELPER_2(vextsh2w, void, avr, avr)
-DEF_HELPER_2(vextsb2d, void, avr, avr)
-DEF_HELPER_2(vextsh2d, void, avr, avr)
-DEF_HELPER_2(vextsw2d, void, avr, avr)
  DEF_HELPER_2(vnegw, void, avr, avr)
  DEF_HELPER_2(vnegd, void, avr, avr)
  DEF_HELPER_2(vupkhpx, void, avr, avr)
diff --git a/target/ppc/insn32.decode b/target/ppc/insn32.decode
index c4796260b6..757791f0ac 100644
--- a/target/ppc/insn32.decode
+++ b/target/ppc/insn32.decode
@@ -419,6 +419,14 @@ VINSWVRX000100 . . . 0011000@VX
  VSLDBI  000100 . . . 00 ... 010110  @VN
  VSRDBI  000100 . . . 01 ... 010110  @VN
  
+## Vector Integer Arithmetic Instructions

+
+VEXTSB2W000100 . 1 . 1100010@VX_tb
+VEXTSH2W000100 . 10001 . 1100010@VX_tb
+VEXTSB2D000100 . 11000 . 1100010@VX_tb
+VEXTSH2D000100 . 11001 . 1100010@VX_tb
+VEXTSW2D000100 . 11010 . 1100010@VX_tb
+
  ## Vector Mask Manipulation Instructions
  
  MTVSRBM 000100 . 1 . 1100110@VX_tb

diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c
index 79cde68f19..630fbc579a 100644
--- a/target/ppc/int_helper.c
+++ b/target/ppc/int_helper.c
@@ -1768,21 +1768,6 @@ XXBLEND(W, 32)
  XXBLEND(D, 64)
  #undef XXBLEND
  
-#define VEXT_SIGNED(name, element, cast)\

-void helper_##name(ppc_avr_t *r, ppc_avr_t *b)  \
-{   \
-int i;  \
-for (i = 0; i < ARRAY_SIZE(r->element); i++) {  \
-r->element[i] = (cast)b->element[i];\
-}   \
-}
-VEXT_SIGNED(vextsb2w, s32, int8_t)
-VEXT_SIGNED(vextsb2d, s64, int8_t)
-VEXT_SIGNED(vextsh2w, s32, int16_t)
-VEXT_SIGNED(vextsh2d, s64, int16_t)
-VEXT_SIGNED(vextsw2d, s64, int32_t)
-#undef VEXT_SIGNED
-
  #define VNEG(name, element) \
  void helper_##name(ppc_avr_t *r, ppc_avr_t *b)  \
  {   \
diff --git a/target/ppc/translate/vmx-impl.c.inc 
b/target/ppc/translate/vmx-impl.c.inc
index b7559cf94c..ec782c47ff 100644
--- a/target/ppc/translate/vmx-impl.c.inc
+++ b/target/ppc/translate/vmx-impl.c.inc
@@ -1772,11 +1772,26 @@ GEN_VXFORM_TRANS(vclzw, 1, 30)
  GEN_VXFORM_TRANS(vclzd, 1, 31)
  GEN_VXFORM_NOA_2(vnegw, 1, 24, 6)
  GEN_VXFORM_NOA_2(vnegd, 1, 24, 7)
-GEN_VXFORM_NOA_2(vextsb2w, 1, 24, 16)
-GEN_VXFORM_NOA_2(vextsh2w, 1, 24, 17)
-GEN_VXFORM_NOA_2(vextsb2d, 1, 24, 24)
-GEN_VXFORM_NOA_2(vextsh2d, 1, 24, 25)
-GEN_VXFORM_NOA_2(vextsw2d, 1, 24, 26)
+
+static bool do_vexts(DisasContext *ctx, arg_VX_tb *a, int vece, int s)
+{
+REQUIRE_INSNS_FLAGS2(ctx, ISA300);
+REQUIRE_VECTOR(ctx);
+
+tcg_gen_gvec_shli(vece, avr_full_offset(a->vrt), avr_full_offset(a->vrb),
+  s, 16, 16);
+tcg_gen_gvec_sari(vece, avr_full_offset(a->vrt), avr_full_offset(a->vrt),
+  s, 16, 16);


It would be better to collect this into a single composite gvec operation (x86 is 
especially bad with unsupported vector operation sizes).


Use GVecGen3, provide the relevant .fni4/.fni8/.fniv functions and the vecop_list.  We 
have elsewhere relied on 4 integer operations being expanded, when the vector op itself 
isn't supported, so you should be able to drop the .fnio out-of-line helper.



r~

RE: [PATCH v2 06/12] Hexagon (tests/tcg/hexagon) test instructions that might set bits in USR

2022-02-10 Thread Taylor Simpson



> -Original Message-
> From: Richard Henderson 
> Sent: Thursday, February 10, 2022 7:03 PM
> To: Taylor Simpson ; qemu-devel@nongnu.org
> Cc: f4...@amsat.org; a...@rev.ng; Brian Cain ; Michael
> Lambert 
> Subject: Re: [PATCH v2 06/12] Hexagon (tests/tcg/hexagon) test instructions
> that might set bits in USR
> 
> On 2/10/22 13:15, Taylor Simpson wrote:
> > +#define CLEAR_USRBITS \
> > +"r2 = usr\n\t" \
> > +"r2 = clrbit(r2, #0)\n\t" \
> > +"r2 = clrbit(r2, #1)\n\t" \
> > +"r2 = clrbit(r2, #2)\n\t" \
> > +"r2 = clrbit(r2, #3)\n\t" \
> > +"r2 = clrbit(r2, #4)\n\t" \
> > +"r2 = clrbit(r2, #5)\n\t" \
> > +"usr = r2\n\t"
> 
> It's just a test case, so it doesn't really matter, but
> 
>  r2 = and(r2, #~0x3f)

Our assembler won't parse the ~.  So, I'll have to go with 0xfc0.

Taylor

Re: [PATCH v3 06/37] target/ppc: Implement vmsumudm instruction

2022-02-10 Thread Richard Henderson


On 2/10/22 23:34, matheus.fe...@eldorado.org.br wrote:

From: Víctor Colombo

Based on [1] by Lijun Pan, which was never merged
into master.

[1]:https://lists.gnu.org/archive/html/qemu-ppc/2020-07/msg00419.html

Signed-off-by: Víctor Colombo
Signed-off-by: Matheus Ferst
---
  target/ppc/insn32.decode|  1 +
  target/ppc/translate/vmx-impl.c.inc | 34 +
  2 files changed, 35 insertions(+)


Reviewed-by: Richard Henderson 

r~

Re: [PATCH v3 05/37] target/ppc: Implement vmsumcud instruction

2022-02-10 Thread Richard Henderson


On 2/10/22 23:34, matheus.fe...@eldorado.org.br wrote:

+/*
+ * Discard lower 64-bits, leaving the carry into bit 64.
+ * Then sum the higher 64-bit elements.
+ */
+tcg_gen_mov_i64(tmp1, tmp0);
+get_avr64(tmp0, a->rc, true);
+tcg_gen_add2_i64(tmp1, tmp0, tmp0, zero, prod1h, zero);


The move into tmp1 is dead here.
I think you wanted a third add2 here, adding the old tmp0 + new rc word.


r~

Re: [PATCH v3 04/37] target/ppc: vmulh* instructions use gvec

2022-02-10 Thread Richard Henderson


On 2/10/22 23:34, matheus.fe...@eldorado.org.br wrote:

+static void do_vx_vmulhu_vec(unsigned vece, TCGv_vec t, TCGv_vec a, TCGv_vec b)
+{
+TCGv_vec a1, b1, mask, w, k;
+unsigned bits;
+bits = (vece == MO_32) ? 16 : 32;
+
+a1 = tcg_temp_new_vec_matching(t);
+b1 = tcg_temp_new_vec_matching(t);
+w  = tcg_temp_new_vec_matching(t);
+k  = tcg_temp_new_vec_matching(t);
+mask = tcg_temp_new_vec_matching(t);
+
+tcg_gen_dupi_vec(vece, mask, (vece == MO_32) ? 0x : 0x);
+tcg_gen_and_vec(vece, a1, a, mask);
+tcg_gen_and_vec(vece, b1, b, mask);
+tcg_gen_mul_vec(vece, t, a1, b1);
+tcg_gen_shri_vec(vece, k, t, bits);
+
+tcg_gen_shri_vec(vece, a1, a, bits);
+tcg_gen_mul_vec(vece, t, a1, b1);
+tcg_gen_add_vec(vece, t, t, k);
+tcg_gen_and_vec(vece, k, t, mask);
+tcg_gen_shri_vec(vece, w, t, bits);
+
+tcg_gen_and_vec(vece, a1, a, mask);
+tcg_gen_shri_vec(vece, b1, b, bits);
+tcg_gen_mul_vec(vece, t, a1, b1);
+tcg_gen_add_vec(vece, t, t, k);
+tcg_gen_shri_vec(vece, k, t, bits);
+
+tcg_gen_shri_vec(vece, a1, a, bits);
+tcg_gen_mul_vec(vece, t, a1, b1);
+tcg_gen_add_vec(vece, t, t, w);
+tcg_gen_add_vec(vece, t, t, k);


I don't think that you should decompose 4 high-part 32-bit multiplies into 4 32-bit 
multiplies plus lots of arithmetic.  This is not a win.  You're actually better off with 
pure integer arithmetic here.


You could instead widen these into 2 64-bit multiplies, plus some arithmetic.  That's 
certainly closer to the break-even point.



+{
+.fniv = do_vx_vmulhu_vec,
+.fno  = gen_helper_VMULHUD,
+.opt_opc = vecop_list,
+.vece = MO_64
+},
+};


As for the two high-part 64-bit multiplies, I think that should definitely remain an 
integer operation.


You probably want to expand these with inline integer operations using .fni[48].


+static void do_vx_vmulhs_vec(unsigned vece, TCGv_vec t, TCGv_vec a, TCGv_vec b)


Very much likewise.


r~

Re: [PATCH v3 03/37] target/ppc: Moved vector multiply high and low to decodetree

2022-02-10 Thread Richard Henderson


On 2/10/22 23:34, matheus.fe...@eldorado.org.br wrote:

From: "Lucas Mateus Castro (alqotel)"

Moved instructions vmulld, vmulhuw, vmulhsw, vmulhud and vmulhsd to
decodetree

Signed-off-by: Lucas Mateus Castro (alqotel)
Signed-off-by: Matheus Ferst
---
  target/ppc/helper.h |  8 
  target/ppc/insn32.decode|  6 ++
  target/ppc/int_helper.c |  8 
  target/ppc/translate/vmx-impl.c.inc | 21 -
  target/ppc/translate/vmx-ops.c.inc  |  5 -
  5 files changed, 30 insertions(+), 18 deletions(-)


Reviewed-by: Richard Henderson 

r~

Re: [PATCH v3 02/37] target/ppc: moved vector even and odd multiplication to decodetree

2022-02-10 Thread Richard Henderson


On 2/10/22 23:34, matheus.fe...@eldorado.org.br wrote:

+void helper_VMULESD(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
+{
+muls64(&r->VsrD(1), &r->VsrD(0), a->VsrSD(0), b->VsrSD(0));
+}
+void helper_VMULOSD(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
+{
+muls64(&r->VsrD(1), &r->VsrD(0), a->VsrSD(1), b->VsrSD(1));
+}
+void helper_VMULEUD(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
+{
+mulu64(&r->VsrD(1), &r->VsrD(0), a->VsrD(0), b->VsrD(0));
+}
+void helper_VMULOUD(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
+{
+mulu64(&r->VsrD(1), &r->VsrD(0), a->VsrD(1), b->VsrD(1));
+}


These are single tcg calls; there's no particular need to have them out-of-line, except 
perhaps just to make it easier for your pattern expansion.  But perhaps not important 
enough to worry about.


Reviewed-by: Richard Henderson 


r~

Re: [PATCH] Hexagon (target/hexagon) convert to OBJECT_DECLARE_TYPE

2022-02-10 Thread Richard Henderson


On 2/11/22 14:30, Taylor Simpson wrote:

Suggested-by: Richard Henderson
Signed-off-by: Taylor Simpson
---
  target/hexagon/cpu.h | 9 ++---
  1 file changed, 2 insertions(+), 7 deletions(-)


Reviewed-by: Richard Henderson 

r~

[PATCH] Hexagon (target/hexagon) convert to OBJECT_DECLARE_TYPE

2022-02-10 Thread Taylor Simpson

Suggested-by: Richard Henderson 
Signed-off-by: Taylor Simpson 
---
 target/hexagon/cpu.h | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/target/hexagon/cpu.h b/target/hexagon/cpu.h
index 58a0d3870b..e3efbb2303 100644
--- a/target/hexagon/cpu.h
+++ b/target/hexagon/cpu.h
@@ -1,5 +1,5 @@
 /*
- *  Copyright(c) 2019-2021 Qualcomm Innovation Center, Inc. All Rights 
Reserved.
+ *  Copyright(c) 2019-2022 Qualcomm Innovation Center, Inc. All Rights 
Reserved.
  *
  *  This program is free software; you can redistribute it and/or modify
  *  it under the terms of the GNU General Public License as published by
@@ -131,12 +131,7 @@ struct CPUHexagonState {
 VTCMStoreLog vtcm_log;
 };
 
-#define HEXAGON_CPU_CLASS(klass) \
-OBJECT_CLASS_CHECK(HexagonCPUClass, (klass), TYPE_HEXAGON_CPU)
-#define HEXAGON_CPU(obj) \
-OBJECT_CHECK(HexagonCPU, (obj), TYPE_HEXAGON_CPU)
-#define HEXAGON_CPU_GET_CLASS(obj) \
-OBJECT_GET_CLASS(HexagonCPUClass, (obj), TYPE_HEXAGON_CPU)
+OBJECT_DECLARE_TYPE(HexagonCPU, HexagonCPUClass, HEXAGON_CPU)
 
 typedef struct HexagonCPUClass {
 /*< private >*/
-- 
2.17.1

RE: [PATCH 11/15] target: Use ArchCPU as interface to target CPU

2022-02-10 Thread Taylor Simpson


> -Original Message-
> From: Richard Henderson 
> Sent: Thursday, February 10, 2022 7:22 PM
> To: Taylor Simpson ; Philippe Mathieu-Daudé
> ; qemu-devel@nongnu.org
> Cc: Paolo Bonzini ; Thomas Huth
> 
> Subject: Re: [PATCH 11/15] target: Use ArchCPU as interface to target CPU
> 
> On 2/11/22 04:35, Taylor Simpson wrote:
> > -#define HEXAGON_CPU_CLASS(klass) \
> > -OBJECT_CLASS_CHECK(HexagonCPUClass, (klass),
> TYPE_HEXAGON_CPU)
> > -#define HEXAGON_CPU(obj) \
> > -OBJECT_CHECK(HexagonCPU, (obj), TYPE_HEXAGON_CPU)
> > -#define HEXAGON_CPU_GET_CLASS(obj) \
> > -OBJECT_GET_CLASS(HexagonCPUClass, (obj), TYPE_HEXAGON_CPU)
> > +OBJECT_DECLARE_TYPE(HexagonCPU, HexagonCPUClass,
> HEXAGON_CPU)
> >
> >   typedef struct HexagonCPUClass {
> >   /*< private >*/
> >
  But it's definitely a smaller change (and matches all of the other targets).
> 
> I do think that the conversion to OBJECT_DECLARE_TYPE should happen first,
> via whichever tree you choose.

OK, I'll send a patch.  Then, submit a pull request along with the other 
changes you just looked at.

Taylor

Re: [PATCH v3] target/riscv: Enable Zicbo[m,z,p] instructions

2022-02-10 Thread Weiwei Li




在 2022/2/11 上午12:34, Christoph Muellner 写道:

The RISC-V base cache management operation ISA extension has been
ratified [1]. This patch adds support for the defined instructions.

The cmo.prefetch instructions are nops for QEMU (no emulation of the memory
hierarchy, no illegal instructions, no permission faults, no traps),
therefore there's only a comment where they would be decoded.

The other cbo* instructions are moved into an overlap group to
resolve the overlapping pattern with the LQ instruction.
The cbo.zero zeros a configurable amount of bytes.
Similar to other extensions (e.g. atomic instructions),
the trap behavior is limited such, that only the page permissions
are checked (ignoring other optional protection mechanisms like
PMA or PMP).

[1] https://wiki.riscv.org/display/TECH/Recently+Ratified+Extensions

v3:
- Enable by default (like zb*)
- Rename flags Zicbo* -> zicbo* (like zb*)
- Rename ext_zicbo* -> ext_icbo* (like ext_icsr)
- Rename trans_zicbo.c.inc -> trans_rvzicbo.c.inc (like all others)
- Simplify prefetch instruction support to a single comment
- Rebase on top of github-alistair23/riscv-to-apply.next plus the
   Priv v1.12 series from github-atishp04/priv_1_12_support_v3

v2:
- Fix overlapping instruction encoding with LQ instructions
- Drop CSR related changes and rebase on Priv 1.12 patchset

Co-developed-by: Philipp Tomsich 
Signed-off-by: Christoph Muellner 
---
  target/riscv/cpu.c  |  3 +
  target/riscv/cpu.h  |  3 +
  target/riscv/helper.h   |  5 ++
  target/riscv/insn32.decode  | 16 +++-
  target/riscv/insn_trans/trans_rvzicbo.c.inc | 57 +
  target/riscv/op_helper.c| 94 +
  target/riscv/translate.c|  1 +
  7 files changed, 178 insertions(+), 1 deletion(-)
  create mode 100644 target/riscv/insn_trans/trans_rvzicbo.c.inc

diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
index 39ffb883fc..cbd0a34318 100644
--- a/target/riscv/cpu.c
+++ b/target/riscv/cpu.c
@@ -764,6 +764,9 @@ static Property riscv_cpu_properties[] = {
  DEFINE_PROP_BOOL("Counters", RISCVCPU, cfg.ext_counters, true),
  DEFINE_PROP_BOOL("Zifencei", RISCVCPU, cfg.ext_ifencei, true),
  DEFINE_PROP_BOOL("Zicsr", RISCVCPU, cfg.ext_icsr, true),
+DEFINE_PROP_BOOL("zicbom", RISCVCPU, cfg.ext_icbom, true),
+DEFINE_PROP_BOOL("zicboz", RISCVCPU, cfg.ext_icboz, true),
+DEFINE_PROP_UINT16("cbozlen", RISCVCPU, cfg.cbozlen, 64),
  DEFINE_PROP_BOOL("Zfh", RISCVCPU, cfg.ext_zfh, false),
  DEFINE_PROP_BOOL("Zfhmin", RISCVCPU, cfg.ext_zfhmin, false),
  DEFINE_PROP_BOOL("Zve32f", RISCVCPU, cfg.ext_zve32f, false),
diff --git a/target/riscv/cpu.h b/target/riscv/cpu.h
index fe80caeec0..7bd2fd26d6 100644
--- a/target/riscv/cpu.h
+++ b/target/riscv/cpu.h
@@ -368,6 +368,8 @@ struct RISCVCPUConfig {
  bool ext_counters;
  bool ext_ifencei;
  bool ext_icsr;
+bool ext_icbom;
+bool ext_icboz;
  bool ext_zfh;
  bool ext_zfhmin;
  bool ext_zve32f;
@@ -382,6 +384,7 @@ struct RISCVCPUConfig {
  char *vext_spec;
  uint16_t vlen;
  uint16_t elen;
+uint16_t cbozlen;
  bool mmu;
  bool pmp;
  bool epmp;
diff --git a/target/riscv/helper.h b/target/riscv/helper.h
index 72cc2582f4..ef1944da8f 100644
--- a/target/riscv/helper.h
+++ b/target/riscv/helper.h
@@ -92,6 +92,11 @@ DEF_HELPER_FLAGS_2(fcvt_h_l, TCG_CALL_NO_RWG, i64, env, tl)
  DEF_HELPER_FLAGS_2(fcvt_h_lu, TCG_CALL_NO_RWG, i64, env, tl)
  DEF_HELPER_FLAGS_1(fclass_h, TCG_CALL_NO_RWG_SE, tl, i64)
  
+/* Cache-block operations */

+DEF_HELPER_2(cbo_clean_flush, void, env, tl)
+DEF_HELPER_2(cbo_inval, void, env, tl)
+DEF_HELPER_2(cbo_zero, void, env, tl)
+
  /* Special functions */
  DEF_HELPER_2(csrr, tl, env, int)
  DEF_HELPER_3(csrw, void, env, int, tl)
diff --git a/target/riscv/insn32.decode b/target/riscv/insn32.decode
index 5bbedc254c..d5f8329970 100644
--- a/target/riscv/insn32.decode
+++ b/target/riscv/insn32.decode
@@ -128,6 +128,7 @@ addi  . 000 . 0010011 @i
  slti  . 010 . 0010011 @i
  sltiu . 011 . 0010011 @i
  xori  . 100 . 0010011 @i
+# cbo.prefetch_{i,r,m} instructions are ori with rd=x0 and not decoded.
  ori   . 110 . 0010011 @i
  andi  . 111 . 0010011 @i
  slli 0. ... 001 . 0010011 @sh
@@ -168,7 +169,20 @@ sraw 010 .  . 101 . 0111011 @r
  
  # *** RV128I Base Instruction Set (in addition to RV64I) ***

  ldu     . 111 . 011 @i
-lq      . 010 . 000 @i
+{
+  [
+# *** RV32 Zicbom Standard Extension ***
+cbo_clean  000 1 . 010 0 000 @sfence_vm
+cbo_flush  000 00010 . 010 0 000 @sfence_vm
+cbo_inval  000 0 .

[PULL 34/34] tests/tcg/multiarch: Add sigbus.c

2022-02-10 Thread Richard Henderson

A mostly generic test for unaligned access raising SIGBUS.

Reviewed-by: Peter Maydell 
Signed-off-by: Richard Henderson 
---
 tests/tcg/multiarch/sigbus.c | 68 
 1 file changed, 68 insertions(+)
 create mode 100644 tests/tcg/multiarch/sigbus.c

diff --git a/tests/tcg/multiarch/sigbus.c b/tests/tcg/multiarch/sigbus.c
new file mode 100644
index 00..8134c5fd56
--- /dev/null
+++ b/tests/tcg/multiarch/sigbus.c
@@ -0,0 +1,68 @@
+#define _GNU_SOURCE 1
+
+#include 
+#include 
+#include 
+#include 
+
+
+unsigned long long x = 0x8877665544332211ull;
+void * volatile p = (void *)&x + 1;
+
+void sigbus(int sig, siginfo_t *info, void *uc)
+{
+assert(sig == SIGBUS);
+assert(info->si_signo == SIGBUS);
+#ifdef BUS_ADRALN
+assert(info->si_code == BUS_ADRALN);
+#endif
+assert(info->si_addr == p);
+exit(EXIT_SUCCESS);
+}
+
+int main()
+{
+struct sigaction sa = {
+.sa_sigaction = sigbus,
+.sa_flags = SA_SIGINFO
+};
+int allow_fail = 0;
+int tmp;
+
+tmp = sigaction(SIGBUS, &sa, NULL);
+assert(tmp == 0);
+
+/*
+ * Select an operation that's likely to enforce alignment.
+ * On many guests that support unaligned accesses by default,
+ * this is often an atomic operation.
+ */
+#if defined(__aarch64__)
+asm volatile("ldxr %w0,[%1]" : "=r"(tmp) : "r"(p) : "memory");
+#elif defined(__alpha__)
+asm volatile("ldl_l %0,0(%1)" : "=r"(tmp) : "r"(p) : "memory");
+#elif defined(__arm__)
+asm volatile("ldrex %0,[%1]" : "=r"(tmp) : "r"(p) : "memory");
+#elif defined(__powerpc__)
+asm volatile("lwarx %0,0,%1" : "=r"(tmp) : "r"(p) : "memory");
+#elif defined(__riscv_atomic)
+asm volatile("lr.w %0,(%1)" : "=r"(tmp) : "r"(p) : "memory");
+#else
+/* No insn known to fault unaligned -- try for a straight load. */
+allow_fail = 1;
+tmp = *(volatile int *)p;
+#endif
+
+assert(allow_fail);
+
+/*
+ * We didn't see a signal.
+ * We might as well validate the unaligned load worked.
+ */
+if (BYTE_ORDER == LITTLE_ENDIAN) {
+assert(tmp == 0x55443322);
+} else {
+assert(tmp == 0x77665544);
+}
+return EXIT_SUCCESS;
+}
-- 
2.25.1

[PULL 32/34] tcg/sparc: Add tcg_out_jmpl_const for better tail calls

2022-02-10 Thread Richard Henderson

Due to mapping changes, we now rarely place the code_gen_buffer
near the main executable.  Which means that direct calls will
now rarely be in range.

So, always use indirect calls for tail calls, which allows us to
avoid clobbering %o7, and therefore we need not save and restore it.

Reviewed-by: Peter Maydell 
Signed-off-by: Richard Henderson 
---
 tcg/sparc/tcg-target.c.inc | 37 +++--
 1 file changed, 23 insertions(+), 14 deletions(-)

diff --git a/tcg/sparc/tcg-target.c.inc b/tcg/sparc/tcg-target.c.inc
index e78945d153..646bb462c3 100644
--- a/tcg/sparc/tcg-target.c.inc
+++ b/tcg/sparc/tcg-target.c.inc
@@ -858,6 +858,19 @@ static void tcg_out_addsub2_i64(TCGContext *s, TCGReg rl, 
TCGReg rh,
 tcg_out_mov(s, TCG_TYPE_I64, rl, tmp);
 }
 
+static void tcg_out_jmpl_const(TCGContext *s, const tcg_insn_unit *dest,
+   bool in_prologue, bool tail_call)
+{
+uintptr_t desti = (uintptr_t)dest;
+
+/* Be careful not to clobber %o7 for a tail call. */
+tcg_out_movi_int(s, TCG_TYPE_PTR, TCG_REG_T1,
+ desti & ~0xfff, in_prologue,
+ tail_call ? TCG_REG_G2 : TCG_REG_O7);
+tcg_out_arithi(s, tail_call ? TCG_REG_G0 : TCG_REG_O7,
+   TCG_REG_T1, desti & 0xfff, JMPL);
+}
+
 static void tcg_out_call_nodelay(TCGContext *s, const tcg_insn_unit *dest,
  bool in_prologue)
 {
@@ -866,10 +879,7 @@ static void tcg_out_call_nodelay(TCGContext *s, const 
tcg_insn_unit *dest,
 if (disp == (int32_t)disp) {
 tcg_out32(s, CALL | (uint32_t)disp >> 2);
 } else {
-uintptr_t desti = (uintptr_t)dest;
-tcg_out_movi_int(s, TCG_TYPE_PTR, TCG_REG_T1,
- desti & ~0xfff, in_prologue, TCG_REG_O7);
-tcg_out_arithi(s, TCG_REG_O7, TCG_REG_T1, desti & 0xfff, JMPL);
+tcg_out_jmpl_const(s, dest, in_prologue, false);
 }
 }
 
@@ -960,11 +970,10 @@ static void build_trampolines(TCGContext *s)
 
 /* Set the retaddr operand.  */
 tcg_out_mov(s, TCG_TYPE_PTR, ra, TCG_REG_O7);
-/* Set the env operand.  */
-tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_O0, TCG_AREG0);
 /* Tail call.  */
-tcg_out_call_nodelay(s, qemu_ld_helpers[i], true);
-tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_O7, ra);
+tcg_out_jmpl_const(s, qemu_ld_helpers[i], true, true);
+/* delay slot -- set the env argument */
+tcg_out_mov_delay(s, TCG_REG_O0, TCG_AREG0);
 }
 
 for (i = 0; i < ARRAY_SIZE(qemu_st_helpers); ++i) {
@@ -1006,14 +1015,14 @@ static void build_trampolines(TCGContext *s)
 if (ra >= TCG_REG_O6) {
 tcg_out_st(s, TCG_TYPE_PTR, TCG_REG_O7, TCG_REG_CALL_STACK,
TCG_TARGET_CALL_STACK_OFFSET);
-ra = TCG_REG_G1;
+} else {
+tcg_out_mov(s, TCG_TYPE_PTR, ra, TCG_REG_O7);
 }
-tcg_out_mov(s, TCG_TYPE_PTR, ra, TCG_REG_O7);
-/* Set the env operand.  */
-tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_O0, TCG_AREG0);
+
 /* Tail call.  */
-tcg_out_call_nodelay(s, qemu_st_helpers[i], true);
-tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_O7, ra);
+tcg_out_jmpl_const(s, qemu_st_helpers[i], true, true);
+/* delay slot -- set the env argument */
+tcg_out_mov_delay(s, TCG_REG_O0, TCG_AREG0);
 }
 }
 #endif
-- 
2.25.1

[PULL 26/34] tcg/sparc: Use tcg_out_movi_imm13 in tcg_out_addsub2_i64

2022-02-10 Thread Richard Henderson

When BH is constant, it is constrained to 11 bits for use in MOVCC.
For the cases in which we must load the constant BH into a register,
we do not need the full logic of tcg_out_movi; we can use the simpler
function for emitting a 13 bit constant.

This eliminates the only case in which TCG_REG_T2 was passed to
tcg_out_movi, which will shortly become invalid.

Reviewed-by: Peter Maydell 
Signed-off-by: Richard Henderson 
---
 tcg/sparc/tcg-target.c.inc | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/tcg/sparc/tcg-target.c.inc b/tcg/sparc/tcg-target.c.inc
index 0c062c60eb..8d5992ef29 100644
--- a/tcg/sparc/tcg-target.c.inc
+++ b/tcg/sparc/tcg-target.c.inc
@@ -795,7 +795,7 @@ static void tcg_out_addsub2_i64(TCGContext *s, TCGReg rl, 
TCGReg rh,
 if (use_vis3_instructions && !is_sub) {
 /* Note that ADDXC doesn't accept immediates.  */
 if (bhconst && bh != 0) {
-   tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_T2, bh);
+   tcg_out_movi_imm13(s, TCG_REG_T2, bh);
bh = TCG_REG_T2;
 }
 tcg_out_arith(s, rh, ah, bh, ARITH_ADDXC);
@@ -811,9 +811,13 @@ static void tcg_out_addsub2_i64(TCGContext *s, TCGReg rl, 
TCGReg rh,
tcg_out_movcc(s, TCG_COND_GEU, MOVCC_XCC, rh, ah, 0);
}
 } else {
-/* Otherwise adjust BH as if there is carry into T2 ... */
+/*
+ * Otherwise adjust BH as if there is carry into T2.
+ * Note that constant BH is constrained to 11 bits for the MOVCC,
+ * so the adjustment fits 12 bits.
+ */
 if (bhconst) {
-tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_T2, bh + (is_sub ? -1 : 1));
+tcg_out_movi_imm13(s, TCG_REG_T2, bh + (is_sub ? -1 : 1));
 } else {
 tcg_out_arithi(s, TCG_REG_T2, bh, 1,
is_sub ? ARITH_SUB : ARITH_ADD);
-- 
2.25.1

[PULL 30/34] tcg/sparc: Convert patch_reloc to return bool

2022-02-10 Thread Richard Henderson

Since 7ecd02a06f8, if patch_reloc fails we restart translation
with a smaller TB.  SPARC had its function signature changed,
but not the logic.  Replace assert with return false.

Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Peter Maydell 
Signed-off-by: Richard Henderson 
---
 tcg/sparc/tcg-target.c.inc | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/tcg/sparc/tcg-target.c.inc b/tcg/sparc/tcg-target.c.inc
index ed2f4ecc40..213aba4be6 100644
--- a/tcg/sparc/tcg-target.c.inc
+++ b/tcg/sparc/tcg-target.c.inc
@@ -323,12 +323,16 @@ static bool patch_reloc(tcg_insn_unit *src_rw, int type,
 
 switch (type) {
 case R_SPARC_WDISP16:
-assert(check_fit_ptr(pcrel >> 2, 16));
+if (!check_fit_ptr(pcrel >> 2, 16)) {
+return false;
+}
 insn &= ~INSN_OFF16(-1);
 insn |= INSN_OFF16(pcrel);
 break;
 case R_SPARC_WDISP19:
-assert(check_fit_ptr(pcrel >> 2, 19));
+if (!check_fit_ptr(pcrel >> 2, 19)) {
+return false;
+}
 insn &= ~INSN_OFF19(-1);
 insn |= INSN_OFF19(pcrel);
 break;
-- 
2.25.1

[PULL 28/34] tcg/sparc: Add scratch argument to tcg_out_movi_int

2022-02-10 Thread Richard Henderson

This will allow us to control exactly what scratch register is
used for loading the constant.

Reviewed-by: Peter Maydell 
Signed-off-by: Richard Henderson 
---
 tcg/sparc/tcg-target.c.inc | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/tcg/sparc/tcg-target.c.inc b/tcg/sparc/tcg-target.c.inc
index 2f7c8dcb0a..7a8f20ee9a 100644
--- a/tcg/sparc/tcg-target.c.inc
+++ b/tcg/sparc/tcg-target.c.inc
@@ -428,7 +428,8 @@ static void tcg_out_movi_imm32(TCGContext *s, TCGReg ret, 
int32_t arg)
 }
 
 static void tcg_out_movi_int(TCGContext *s, TCGType type, TCGReg ret,
- tcg_target_long arg, bool in_prologue)
+ tcg_target_long arg, bool in_prologue,
+ TCGReg scratch)
 {
 tcg_target_long hi, lo = (int32_t)arg;
 tcg_target_long test, lsb;
@@ -483,16 +484,17 @@ static void tcg_out_movi_int(TCGContext *s, TCGType type, 
TCGReg ret,
 } else {
 hi = arg >> 32;
 tcg_out_movi_imm32(s, ret, hi);
-tcg_out_movi_imm32(s, TCG_REG_T2, lo);
+tcg_out_movi_imm32(s, scratch, lo);
 tcg_out_arithi(s, ret, ret, 32, SHIFT_SLLX);
-tcg_out_arith(s, ret, ret, TCG_REG_T2, ARITH_OR);
+tcg_out_arith(s, ret, ret, scratch, ARITH_OR);
 }
 }
 
 static void tcg_out_movi(TCGContext *s, TCGType type,
  TCGReg ret, tcg_target_long arg)
 {
-tcg_out_movi_int(s, type, ret, arg, false);
+tcg_debug_assert(ret != TCG_REG_T2);
+tcg_out_movi_int(s, type, ret, arg, false, TCG_REG_T2);
 }
 
 static void tcg_out_ldst_rr(TCGContext *s, TCGReg data, TCGReg a1,
@@ -847,7 +849,7 @@ static void tcg_out_call_nodelay(TCGContext *s, const 
tcg_insn_unit *dest,
 } else {
 uintptr_t desti = (uintptr_t)dest;
 tcg_out_movi_int(s, TCG_TYPE_PTR, TCG_REG_T1,
- desti & ~0xfff, in_prologue);
+ desti & ~0xfff, in_prologue, TCG_REG_O7);
 tcg_out_arithi(s, TCG_REG_O7, TCG_REG_T1, desti & 0xfff, JMPL);
 }
 }
@@ -1023,7 +1025,8 @@ static void tcg_target_qemu_prologue(TCGContext *s)
 
 #ifndef CONFIG_SOFTMMU
 if (guest_base != 0) {
-tcg_out_movi_int(s, TCG_TYPE_PTR, TCG_GUEST_BASE_REG, guest_base, 
true);
+tcg_out_movi_int(s, TCG_TYPE_PTR, TCG_GUEST_BASE_REG,
+ guest_base, true, TCG_REG_T1);
 tcg_regset_set_reg(s->reserved_regs, TCG_GUEST_BASE_REG);
 }
 #endif
-- 
2.25.1

Re: [RFC PATCH v5 00/30] Add LoongArch softmmu support

2022-02-10 Thread yangxiaojuan

Hi, Mark

On 02/05/2022 09:32 PM, Mark Cave-Ayland wrote:
> On 28/01/2022 03:40, Xiaojuan Yang wrote:
> 
>> This series patch add softmmu support for LoongArch.
>> The latest kernel:
>>* https://github.com/loongson/linux/tree/loongarch-next
>> The latest uefi:
>>* https://github.com/loongson/edk2
>>* https://github.com/loongson/edk2-platforms
>> The manual:
>>* 
>> https://github.com/loongson/LoongArch-Documentation/releases/tag/2021.10.11
>>
>> You can get LoongArch qemu series like this:
>> git clone https://github.com/loongson/qemu.git
>> git checkout tcg-dev
>>
>> Changes for v5:
>>
>> 1. Fix host bridge map irq function.
>> 2. Move cpu timer init function into machine init.
>> 3. Adjust memory region layout.
>> 4. Add the documentation at docs/system/loongarch/loongson3.rst.
>> - Introduction to 3a5000 virt.
>> - Output of "info mtree".
>>
>> Changes for v4:
>> 1. Uefi code is open and add some fdt interface to pass info between qemu 
>> and uefi.
>> 2. Use a per cpu address space for iocsr.
>> 3. Modify the tlb emulation.
>> 4. Machine and board code mainly follow Mark's advice.
>> 5. Adjust pci host space map.
>> 6. Use more memregion to simplify the interrupt controller's emulate.
>>
>>
>> Changes for v3:
>> 1.Target code mainly follow Richard's code review comments.
>> 2.Put the csr and iocsr read/write instruction emulate into 2 different 
>> patch.
>> 3.Simply the tlb emulation.
>> 4.Delete some unused csr registers defintion.
>> 5.Machine and board code mainly follow Mark's advice, discard the obsolete 
>> interface.
>> 6.NUMA function is removed for it is not completed.
>> 7.Adjust some format problem and the Naming problem
>>
>>
>> Changes for v3:
>> 1.Target code mainly follow Richard's code review comments.
>> 2.Put the csr and iocsr read/write instruction emulate into 2 different 
>> patch.
>> 3.Simply the tlb emulation.
>> 4.Delete some unused csr registers defintion.
>> 5.Machine and board code mainly follow Mark's advice, discard the obsolete 
>> interface.
>> 6.NUMA function is removed for it is not completed.
>> 7.Adjust some format problem and the Naming problem
>>
>>
>> Changes for v2:
>> 1.Combine patch 2 and 3 into one.
>> 2.Adjust the order of the patch.
>> 3.Put all the binaries on the github.
>> 4.Modify some emulate errors when use the kernel from the github.
>> 5.Adjust some format problem and the Naming problem
>> 6.Others mainly follow Richard's code review comments.
>>
>> Please help review!
>>
>> Thanks
>>
>>
>> Xiaojuan Yang (30):
>>target/loongarch: Add system emulation introduction
>>target/loongarch: Add CSRs definition
>>target/loongarch: Add basic vmstate description of CPU.
>>target/loongarch: Implement qmp_query_cpu_definitions()
>>target/loongarch: Add constant timer support
>>target/loongarch: Add MMU support for LoongArch CPU.
>>target/loongarch: Add LoongArch CSR instruction
>>target/loongarch: Add LoongArch IOCSR instruction
>>target/loongarch: Add TLB instruction support
>>target/loongarch: Add other core instructions support
>>target/loongarch: Add LoongArch interrupt and exception handle
>>target/loongarch: Add timer related instructions support.
>>target/loongarch: Add gdb support.
>>hw/pci-host: Add ls7a1000 PCIe Host bridge support for Loongson3
>>  Platform
>>hw/loongarch: Add support loongson3-ls7a machine type.
>>hw/loongarch: Add LoongArch cpu interrupt support(CPUINTC)
>>hw/loongarch: Add LoongArch ipi interrupt support(IPI)
>>hw/intc: Add LoongArch ls7a interrupt controller support(PCH-PIC)
>>hw/intc: Add LoongArch ls7a msi interrupt controller support(PCH-MSI)
>>hw/intc: Add LoongArch extioi interrupt controller(EIOINTC)
>>hw/loongarch: Add irq hierarchy for the system
>>Enable common virtio pci support for LoongArch
>>hw/loongarch: Add some devices support for 3A5000.
>>hw/loongarch: Add LoongArch ls7a rtc device support
>>hw/loongarch: Add default bios startup support.
>>hw/loongarch: Add -kernel and -initrd options support
>>hw/loongarch: Add LoongArch smbios support
>>hw/loongarch: Add LoongArch acpi support
>>hw/loongarch: Add fdt support.
>>tests/tcg/loongarch64: Add hello/memory test in loongarch64 system
>>
>>   .../devices/loongarch64-softmmu/default.mak   |   3 +
>>   configs/targets/loongarch64-softmmu.mak   |   4 +
>>   docs/system/loongarch/loongson3.rst   |  78 ++
>>   gdb-xml/loongarch-base64.xml  |  43 +
>>   gdb-xml/loongarch-fpu64.xml   |  57 ++
>>   hw/Kconfig|   1 +
>>   hw/acpi/Kconfig   |   4 +
>>   hw/acpi/ls7a.c| 374 +
>>   hw/acpi/meson.build   |   1 +
>>   hw/intc/Kconfig   |  15 +
>>   hw/intc/loongarch_extioi.c| 409 ++
>>   hw/intc/loongarch_

[PULL 17/34] tcg/arm: Drop support for armv4 and armv5 hosts

2022-02-10 Thread Richard Henderson

Support for unaligned accesses is difficult for pre-v6 hosts.
While debian still builds for armv4, we cannot use a compile
time test, so test the architecture at runtime and error out.

Reviewed-by: Peter Maydell 
Signed-off-by: Richard Henderson 
---
 tcg/arm/tcg-target.c.inc | 5 +
 1 file changed, 5 insertions(+)

diff --git a/tcg/arm/tcg-target.c.inc b/tcg/arm/tcg-target.c.inc
index 5345c4e39c..29d63e98a8 100644
--- a/tcg/arm/tcg-target.c.inc
+++ b/tcg/arm/tcg-target.c.inc
@@ -2474,6 +2474,11 @@ static void tcg_target_init(TCGContext *s)
 if (pl != NULL && pl[0] == 'v' && pl[1] >= '4' && pl[1] <= '9') {
 arm_arch = pl[1] - '0';
 }
+
+if (arm_arch < 6) {
+error_report("TCG: ARMv%d is unsupported; exiting", arm_arch);
+exit(EXIT_FAILURE);
+}
 }
 
 tcg_target_available_regs[TCG_TYPE_I32] = ALL_GENERAL_REGS;
-- 
2.25.1

[PULL 27/34] tcg/sparc: Split out tcg_out_movi_imm32

2022-02-10 Thread Richard Henderson

Handle 32-bit constants with a separate function, so that
tcg_out_movi_int does not need to recurse.  This slightly
rearranges the order of tests for small constants, but
produces the same output.

Reviewed-by: Peter Maydell 
Signed-off-by: Richard Henderson 
---
 tcg/sparc/tcg-target.c.inc | 36 +---
 1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/tcg/sparc/tcg-target.c.inc b/tcg/sparc/tcg-target.c.inc
index 8d5992ef29..2f7c8dcb0a 100644
--- a/tcg/sparc/tcg-target.c.inc
+++ b/tcg/sparc/tcg-target.c.inc
@@ -413,15 +413,30 @@ static void tcg_out_movi_imm13(TCGContext *s, TCGReg ret, 
int32_t arg)
 tcg_out_arithi(s, ret, TCG_REG_G0, arg, ARITH_OR);
 }
 
+static void tcg_out_movi_imm32(TCGContext *s, TCGReg ret, int32_t arg)
+{
+if (check_fit_i32(arg, 13)) {
+/* A 13-bit constant sign-extended to 64-bits.  */
+tcg_out_movi_imm13(s, ret, arg);
+} else {
+/* A 32-bit constant zero-extended to 64 bits.  */
+tcg_out_sethi(s, ret, arg);
+if (arg & 0x3ff) {
+tcg_out_arithi(s, ret, ret, arg & 0x3ff, ARITH_OR);
+}
+}
+}
+
 static void tcg_out_movi_int(TCGContext *s, TCGType type, TCGReg ret,
  tcg_target_long arg, bool in_prologue)
 {
 tcg_target_long hi, lo = (int32_t)arg;
 tcg_target_long test, lsb;
 
-/* Make sure we test 32-bit constants for imm13 properly.  */
-if (type == TCG_TYPE_I32) {
-arg = lo;
+/* A 32-bit constant, or 32-bit zero-extended to 64-bits.  */
+if (type == TCG_TYPE_I32 || arg == (uint32_t)arg) {
+tcg_out_movi_imm32(s, ret, arg);
+return;
 }
 
 /* A 13-bit constant sign-extended to 64-bits.  */
@@ -439,15 +454,6 @@ static void tcg_out_movi_int(TCGContext *s, TCGType type, 
TCGReg ret,
 }
 }
 
-/* A 32-bit constant, or 32-bit zero-extended to 64-bits.  */
-if (type == TCG_TYPE_I32 || arg == (uint32_t)arg) {
-tcg_out_sethi(s, ret, arg);
-if (arg & 0x3ff) {
-tcg_out_arithi(s, ret, ret, arg & 0x3ff, ARITH_OR);
-}
-return;
-}
-
 /* A 32-bit constant sign-extended to 64-bits.  */
 if (arg == lo) {
 tcg_out_sethi(s, ret, ~arg);
@@ -471,13 +477,13 @@ static void tcg_out_movi_int(TCGContext *s, TCGType type, 
TCGReg ret,
 /* A 64-bit constant decomposed into 2 32-bit pieces.  */
 if (check_fit_i32(lo, 13)) {
 hi = (arg - lo) >> 32;
-tcg_out_movi(s, TCG_TYPE_I32, ret, hi);
+tcg_out_movi_imm32(s, ret, hi);
 tcg_out_arithi(s, ret, ret, 32, SHIFT_SLLX);
 tcg_out_arithi(s, ret, ret, lo, ARITH_ADD);
 } else {
 hi = arg >> 32;
-tcg_out_movi(s, TCG_TYPE_I32, ret, hi);
-tcg_out_movi(s, TCG_TYPE_I32, TCG_REG_T2, lo);
+tcg_out_movi_imm32(s, ret, hi);
+tcg_out_movi_imm32(s, TCG_REG_T2, lo);
 tcg_out_arithi(s, ret, ret, 32, SHIFT_SLLX);
 tcg_out_arith(s, ret, ret, TCG_REG_T2, ARITH_OR);
 }
-- 
2.25.1

[PULL 25/34] tcg/mips: Support unaligned access for softmmu

2022-02-10 Thread Richard Henderson

We can use the routines just added for user-only to emit
unaligned accesses in softmmu mode too.

Tested-by: Jiaxun Yang 
Reviewed-by: Jiaxun Yang 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Richard Henderson 
---
 tcg/mips/tcg-target.c.inc | 91 ++-
 1 file changed, 51 insertions(+), 40 deletions(-)

diff --git a/tcg/mips/tcg-target.c.inc b/tcg/mips/tcg-target.c.inc
index 2c94ac2ed6..993149d18a 100644
--- a/tcg/mips/tcg-target.c.inc
+++ b/tcg/mips/tcg-target.c.inc
@@ -1134,8 +1134,10 @@ static void tcg_out_tlb_load(TCGContext *s, TCGReg base, 
TCGReg addrl,
  tcg_insn_unit *label_ptr[2], bool is_load)
 {
 MemOp opc = get_memop(oi);
-unsigned s_bits = opc & MO_SIZE;
 unsigned a_bits = get_alignment_bits(opc);
+unsigned s_bits = opc & MO_SIZE;
+unsigned a_mask = (1 << a_bits) - 1;
+unsigned s_mask = (1 << s_bits) - 1;
 int mem_index = get_mmuidx(oi);
 int fast_off = TLB_MASK_TABLE_OFS(mem_index);
 int mask_off = fast_off + offsetof(CPUTLBDescFast, mask);
@@ -1143,7 +1145,7 @@ static void tcg_out_tlb_load(TCGContext *s, TCGReg base, 
TCGReg addrl,
 int add_off = offsetof(CPUTLBEntry, addend);
 int cmp_off = (is_load ? offsetof(CPUTLBEntry, addr_read)
: offsetof(CPUTLBEntry, addr_write));
-target_ulong mask;
+target_ulong tlb_mask;
 
 /* Load tlb_mask[mmu_idx] and tlb_table[mmu_idx].  */
 tcg_out_ld(s, TCG_TYPE_PTR, TCG_TMP0, TCG_AREG0, mask_off);
@@ -1157,27 +1159,13 @@ static void tcg_out_tlb_load(TCGContext *s, TCGReg 
base, TCGReg addrl,
 /* Add the tlb_table pointer, creating the CPUTLBEntry address in TMP3.  */
 tcg_out_opc_reg(s, ALIAS_PADD, TCG_TMP3, TCG_TMP3, TCG_TMP1);
 
-/* We don't currently support unaligned accesses.
-   We could do so with mips32r6.  */
-if (a_bits < s_bits) {
-a_bits = s_bits;
-}
-
-/* Mask the page bits, keeping the alignment bits to compare against.  */
-mask = (target_ulong)TARGET_PAGE_MASK | ((1 << a_bits) - 1);
-
 /* Load the (low-half) tlb comparator.  */
 if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
-tcg_out_ld(s, TCG_TYPE_I32, TCG_TMP0, TCG_TMP3, cmp_off + LO_OFF);
-tcg_out_movi(s, TCG_TYPE_I32, TCG_TMP1, mask);
+tcg_out_ldst(s, OPC_LW, TCG_TMP0, TCG_TMP3, cmp_off + LO_OFF);
 } else {
 tcg_out_ldst(s, (TARGET_LONG_BITS == 64 ? OPC_LD
  : TCG_TARGET_REG_BITS == 64 ? OPC_LWU : OPC_LW),
  TCG_TMP0, TCG_TMP3, cmp_off);
-tcg_out_movi(s, TCG_TYPE_TL, TCG_TMP1, mask);
-/* No second compare is required here;
-   load the tlb addend for the fast path.  */
-tcg_out_ld(s, TCG_TYPE_PTR, TCG_TMP2, TCG_TMP3, add_off);
 }
 
 /* Zero extend a 32-bit guest address for a 64-bit host. */
@@ -1185,7 +1173,25 @@ static void tcg_out_tlb_load(TCGContext *s, TCGReg base, 
TCGReg addrl,
 tcg_out_ext32u(s, base, addrl);
 addrl = base;
 }
-tcg_out_opc_reg(s, OPC_AND, TCG_TMP1, TCG_TMP1, addrl);
+
+/*
+ * Mask the page bits, keeping the alignment bits to compare against.
+ * For unaligned accesses, compare against the end of the access to
+ * verify that it does not cross a page boundary.
+ */
+tlb_mask = (target_ulong)TARGET_PAGE_MASK | a_mask;
+tcg_out_movi(s, TCG_TYPE_I32, TCG_TMP1, tlb_mask);
+if (a_mask >= s_mask) {
+tcg_out_opc_reg(s, OPC_AND, TCG_TMP1, TCG_TMP1, addrl);
+} else {
+tcg_out_opc_imm(s, ALIAS_PADDI, TCG_TMP2, addrl, s_mask - a_mask);
+tcg_out_opc_reg(s, OPC_AND, TCG_TMP1, TCG_TMP1, TCG_TMP2);
+}
+
+if (TCG_TARGET_REG_BITS >= TARGET_LONG_BITS) {
+/* Load the tlb addend for the fast path.  */
+tcg_out_ld(s, TCG_TYPE_PTR, TCG_TMP2, TCG_TMP3, add_off);
+}
 
 label_ptr[0] = s->code_ptr;
 tcg_out_opc_br(s, OPC_BNE, TCG_TMP1, TCG_TMP0);
@@ -1193,7 +1199,7 @@ static void tcg_out_tlb_load(TCGContext *s, TCGReg base, 
TCGReg addrl,
 /* Load and test the high half tlb comparator.  */
 if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
 /* delay slot */
-tcg_out_ld(s, TCG_TYPE_I32, TCG_TMP0, TCG_TMP3, cmp_off + HI_OFF);
+tcg_out_ldst(s, OPC_LW, TCG_TMP0, TCG_TMP3, cmp_off + HI_OFF);
 
 /* Load the tlb addend for the fast path.  */
 tcg_out_ld(s, TCG_TYPE_PTR, TCG_TMP2, TCG_TMP3, add_off);
@@ -1515,8 +1521,7 @@ static void tcg_out_qemu_ld_direct(TCGContext *s, TCGReg 
lo, TCGReg hi,
 }
 }
 
-static void __attribute__((unused))
-tcg_out_qemu_ld_unalign(TCGContext *s, TCGReg lo, TCGReg hi,
+static void tcg_out_qemu_ld_unalign(TCGContext *s, TCGReg lo, TCGReg hi,
 TCGReg base, MemOp opc, bool is_64)
 {
 const MIPSInsn lw1 = MIPS_BE ? OPC_LWL : OPC_LWR;
@@ -1645,8 +1650,8 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg 
*args, bool is_64)
 #if defined(CONFIG_SOFTMM

[PULL 13/34] tcg/riscv: Support raising sigbus for user-only

2022-02-10 Thread Richard Henderson

Signed-off-by: Richard Henderson 
---
 tcg/riscv/tcg-target.h |  2 --
 tcg/riscv/tcg-target.c.inc | 63 --
 2 files changed, 61 insertions(+), 4 deletions(-)

diff --git a/tcg/riscv/tcg-target.h b/tcg/riscv/tcg-target.h
index ef78b99e98..11c9b3e4f4 100644
--- a/tcg/riscv/tcg-target.h
+++ b/tcg/riscv/tcg-target.h
@@ -165,9 +165,7 @@ void tb_target_set_jmp_target(uintptr_t, uintptr_t, 
uintptr_t, uintptr_t);
 
 #define TCG_TARGET_DEFAULT_MO (0)
 
-#ifdef CONFIG_SOFTMMU
 #define TCG_TARGET_NEED_LDST_LABELS
-#endif
 #define TCG_TARGET_NEED_POOL_LABELS
 
 #define TCG_TARGET_HAS_MEMORY_BSWAP 0
diff --git a/tcg/riscv/tcg-target.c.inc b/tcg/riscv/tcg-target.c.inc
index e9488f7093..6409d9c3d5 100644
--- a/tcg/riscv/tcg-target.c.inc
+++ b/tcg/riscv/tcg-target.c.inc
@@ -27,6 +27,7 @@
  * THE SOFTWARE.
  */
 
+#include "../tcg-ldst.c.inc"
 #include "../tcg-pool.c.inc"
 
 #ifdef CONFIG_DEBUG_TCG
@@ -847,8 +848,6 @@ static void tcg_out_mb(TCGContext *s, TCGArg a0)
  */
 
 #if defined(CONFIG_SOFTMMU)
-#include "../tcg-ldst.c.inc"
-
 /* helper signature: helper_ret_ld_mmu(CPUState *env, target_ulong addr,
  * MemOpIdx oi, uintptr_t ra)
  */
@@ -1053,6 +1052,54 @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, 
TCGLabelQemuLdst *l)
 tcg_out_goto(s, l->raddr);
 return true;
 }
+#else
+
+static void tcg_out_test_alignment(TCGContext *s, bool is_ld, TCGReg addr_reg,
+   unsigned a_bits)
+{
+unsigned a_mask = (1 << a_bits) - 1;
+TCGLabelQemuLdst *l = new_ldst_label(s);
+
+l->is_ld = is_ld;
+l->addrlo_reg = addr_reg;
+
+/* We are expecting a_bits to max out at 7, so we can always use andi. */
+tcg_debug_assert(a_bits < 12);
+tcg_out_opc_imm(s, OPC_ANDI, TCG_REG_TMP1, addr_reg, a_mask);
+
+l->label_ptr[0] = s->code_ptr;
+tcg_out_opc_branch(s, OPC_BNE, TCG_REG_TMP1, TCG_REG_ZERO, 0);
+
+l->raddr = tcg_splitwx_to_rx(s->code_ptr);
+}
+
+static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
+{
+/* resolve label address */
+if (!reloc_sbimm12(l->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
+return false;
+}
+
+tcg_out_mov(s, TCG_TYPE_TL, TCG_REG_A1, l->addrlo_reg);
+tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_A0, TCG_AREG0);
+
+/* tail call, with the return address back inline. */
+tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_RA, (uintptr_t)l->raddr);
+tcg_out_call_int(s, (const void *)(l->is_ld ? helper_unaligned_ld
+   : helper_unaligned_st), true);
+return true;
+}
+
+static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
+{
+return tcg_out_fail_alignment(s, l);
+}
+
+static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
+{
+return tcg_out_fail_alignment(s, l);
+}
+
 #endif /* CONFIG_SOFTMMU */
 
 static void tcg_out_qemu_ld_direct(TCGContext *s, TCGReg lo, TCGReg hi,
@@ -1108,6 +1155,8 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg 
*args, bool is_64)
 MemOp opc;
 #if defined(CONFIG_SOFTMMU)
 tcg_insn_unit *label_ptr[1];
+#else
+unsigned a_bits;
 #endif
 TCGReg base = TCG_REG_TMP0;
 
@@ -1130,6 +1179,10 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg 
*args, bool is_64)
 tcg_out_ext32u(s, base, addr_regl);
 addr_regl = base;
 }
+a_bits = get_alignment_bits(opc);
+if (a_bits) {
+tcg_out_test_alignment(s, true, addr_regl, a_bits);
+}
 if (guest_base != 0) {
 tcg_out_opc_reg(s, OPC_ADD, base, TCG_GUEST_BASE_REG, addr_regl);
 }
@@ -1174,6 +1227,8 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg 
*args, bool is_64)
 MemOp opc;
 #if defined(CONFIG_SOFTMMU)
 tcg_insn_unit *label_ptr[1];
+#else
+unsigned a_bits;
 #endif
 TCGReg base = TCG_REG_TMP0;
 
@@ -1196,6 +1251,10 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg 
*args, bool is_64)
 tcg_out_ext32u(s, base, addr_regl);
 addr_regl = base;
 }
+a_bits = get_alignment_bits(opc);
+if (a_bits) {
+tcg_out_test_alignment(s, false, addr_regl, a_bits);
+}
 if (guest_base != 0) {
 tcg_out_opc_reg(s, OPC_ADD, base, TCG_GUEST_BASE_REG, addr_regl);
 }
-- 
2.25.1

[PULL 14/34] tcg/s390x: Support raising sigbus for user-only

2022-02-10 Thread Richard Henderson

Reviewed-by: Peter Maydell 
Signed-off-by: Richard Henderson 
---
 tcg/s390x/tcg-target.h |  2 --
 tcg/s390x/tcg-target.c.inc | 59 --
 2 files changed, 57 insertions(+), 4 deletions(-)

diff --git a/tcg/s390x/tcg-target.h b/tcg/s390x/tcg-target.h
index 527ada0f63..69217d995b 100644
--- a/tcg/s390x/tcg-target.h
+++ b/tcg/s390x/tcg-target.h
@@ -178,9 +178,7 @@ static inline void tb_target_set_jmp_target(uintptr_t 
tc_ptr, uintptr_t jmp_rx,
 /* no need to flush icache explicitly */
 }
 
-#ifdef CONFIG_SOFTMMU
 #define TCG_TARGET_NEED_LDST_LABELS
-#endif
 #define TCG_TARGET_NEED_POOL_LABELS
 
 #endif
diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
index b12fbfda63..d56c1e51e4 100644
--- a/tcg/s390x/tcg-target.c.inc
+++ b/tcg/s390x/tcg-target.c.inc
@@ -29,6 +29,7 @@
 #error "unsupported code generation mode"
 #endif
 
+#include "../tcg-ldst.c.inc"
 #include "../tcg-pool.c.inc"
 #include "elf.h"
 
@@ -136,6 +137,7 @@ typedef enum S390Opcode {
 RI_OIHL = 0xa509,
 RI_OILH = 0xa50a,
 RI_OILL = 0xa50b,
+RI_TMLL = 0xa701,
 
 RIE_CGIJ= 0xec7c,
 RIE_CGRJ= 0xec64,
@@ -1804,8 +1806,6 @@ static void tcg_out_qemu_st_direct(TCGContext *s, MemOp 
opc, TCGReg data,
 }
 
 #if defined(CONFIG_SOFTMMU)
-#include "../tcg-ldst.c.inc"
-
 /* We're expecting to use a 20-bit negative offset on the tlb memory ops.  */
 QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
 QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -(1 << 19));
@@ -1942,6 +1942,53 @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, 
TCGLabelQemuLdst *lb)
 return true;
 }
 #else
+static void tcg_out_test_alignment(TCGContext *s, bool is_ld,
+   TCGReg addrlo, unsigned a_bits)
+{
+unsigned a_mask = (1 << a_bits) - 1;
+TCGLabelQemuLdst *l = new_ldst_label(s);
+
+l->is_ld = is_ld;
+l->addrlo_reg = addrlo;
+
+/* We are expecting a_bits to max out at 7, much lower than TMLL. */
+tcg_debug_assert(a_bits < 16);
+tcg_out_insn(s, RI, TMLL, addrlo, a_mask);
+
+tcg_out16(s, RI_BRC | (7 << 4)); /* CC in {1,2,3} */
+l->label_ptr[0] = s->code_ptr;
+s->code_ptr += 1;
+
+l->raddr = tcg_splitwx_to_rx(s->code_ptr);
+}
+
+static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
+{
+if (!patch_reloc(l->label_ptr[0], R_390_PC16DBL,
+ (intptr_t)tcg_splitwx_to_rx(s->code_ptr), 2)) {
+return false;
+}
+
+tcg_out_mov(s, TCG_TYPE_TL, TCG_REG_R3, l->addrlo_reg);
+tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_R2, TCG_AREG0);
+
+/* "Tail call" to the helper, with the return address back inline. */
+tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_R14, (uintptr_t)l->raddr);
+tgen_gotoi(s, S390_CC_ALWAYS, (const void *)(l->is_ld ? helper_unaligned_ld
+ : helper_unaligned_st));
+return true;
+}
+
+static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
+{
+return tcg_out_fail_alignment(s, l);
+}
+
+static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
+{
+return tcg_out_fail_alignment(s, l);
+}
+
 static void tcg_prepare_user_ldst(TCGContext *s, TCGReg *addr_reg,
   TCGReg *index_reg, tcg_target_long *disp)
 {
@@ -1980,7 +2027,11 @@ static void tcg_out_qemu_ld(TCGContext* s, TCGReg 
data_reg, TCGReg addr_reg,
 #else
 TCGReg index_reg;
 tcg_target_long disp;
+unsigned a_bits = get_alignment_bits(opc);
 
+if (a_bits) {
+tcg_out_test_alignment(s, true, addr_reg, a_bits);
+}
 tcg_prepare_user_ldst(s, &addr_reg, &index_reg, &disp);
 tcg_out_qemu_ld_direct(s, opc, data_reg, addr_reg, index_reg, disp);
 #endif
@@ -2007,7 +2058,11 @@ static void tcg_out_qemu_st(TCGContext* s, TCGReg 
data_reg, TCGReg addr_reg,
 #else
 TCGReg index_reg;
 tcg_target_long disp;
+unsigned a_bits = get_alignment_bits(opc);
 
+if (a_bits) {
+tcg_out_test_alignment(s, false, addr_reg, a_bits);
+}
 tcg_prepare_user_ldst(s, &addr_reg, &index_reg, &disp);
 tcg_out_qemu_st_direct(s, opc, data_reg, addr_reg, index_reg, disp);
 #endif
-- 
2.25.1

[PULL 23/34] tcg/arm: Support raising sigbus for user-only

2022-02-10 Thread Richard Henderson

Reviewed-by: Peter Maydell 
Signed-off-by: Richard Henderson 
---
 tcg/arm/tcg-target.h |  2 -
 tcg/arm/tcg-target.c.inc | 83 +++-
 2 files changed, 81 insertions(+), 4 deletions(-)

diff --git a/tcg/arm/tcg-target.h b/tcg/arm/tcg-target.h
index 1dd4cd5377..27c27a1f14 100644
--- a/tcg/arm/tcg-target.h
+++ b/tcg/arm/tcg-target.h
@@ -151,9 +151,7 @@ extern bool use_neon_instructions;
 /* not defined -- call should be eliminated at compile time */
 void tb_target_set_jmp_target(uintptr_t, uintptr_t, uintptr_t, uintptr_t);
 
-#ifdef CONFIG_SOFTMMU
 #define TCG_TARGET_NEED_LDST_LABELS
-#endif
 #define TCG_TARGET_NEED_POOL_LABELS
 
 #endif
diff --git a/tcg/arm/tcg-target.c.inc b/tcg/arm/tcg-target.c.inc
index 7eebbfaf02..e1ea69669c 100644
--- a/tcg/arm/tcg-target.c.inc
+++ b/tcg/arm/tcg-target.c.inc
@@ -23,6 +23,7 @@
  */
 
 #include "elf.h"
+#include "../tcg-ldst.c.inc"
 #include "../tcg-pool.c.inc"
 
 int arm_arch = __ARM_ARCH;
@@ -1289,8 +1290,6 @@ static void tcg_out_vldst(TCGContext *s, ARMInsn insn,
 }
 
 #ifdef CONFIG_SOFTMMU
-#include "../tcg-ldst.c.inc"
-
 /* helper signature: helper_ret_ld_mmu(CPUState *env, target_ulong addr,
  * int mmu_idx, uintptr_t ra)
  */
@@ -1592,6 +1591,74 @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, 
TCGLabelQemuLdst *lb)
 tcg_out_goto(s, COND_AL, qemu_st_helpers[opc & MO_SIZE]);
 return true;
 }
+#else
+
+static void tcg_out_test_alignment(TCGContext *s, bool is_ld, TCGReg addrlo,
+   TCGReg addrhi, unsigned a_bits)
+{
+unsigned a_mask = (1 << a_bits) - 1;
+TCGLabelQemuLdst *label = new_ldst_label(s);
+
+label->is_ld = is_ld;
+label->addrlo_reg = addrlo;
+label->addrhi_reg = addrhi;
+
+/* We are expecting a_bits to max out at 7, and can easily support 8. */
+tcg_debug_assert(a_mask <= 0xff);
+/* tst addr, #mask */
+tcg_out_dat_imm(s, COND_AL, ARITH_TST, 0, addrlo, a_mask);
+
+/* blne slow_path */
+label->label_ptr[0] = s->code_ptr;
+tcg_out_bl_imm(s, COND_NE, 0);
+
+label->raddr = tcg_splitwx_to_rx(s->code_ptr);
+}
+
+static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
+{
+if (!reloc_pc24(l->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
+return false;
+}
+
+if (TARGET_LONG_BITS == 64) {
+/* 64-bit target address is aligned into R2:R3. */
+if (l->addrhi_reg != TCG_REG_R2) {
+tcg_out_mov(s, TCG_TYPE_I32, TCG_REG_R2, l->addrlo_reg);
+tcg_out_mov(s, TCG_TYPE_I32, TCG_REG_R3, l->addrhi_reg);
+} else if (l->addrlo_reg != TCG_REG_R3) {
+tcg_out_mov(s, TCG_TYPE_I32, TCG_REG_R3, l->addrhi_reg);
+tcg_out_mov(s, TCG_TYPE_I32, TCG_REG_R2, l->addrlo_reg);
+} else {
+tcg_out_mov(s, TCG_TYPE_I32, TCG_REG_R1, TCG_REG_R2);
+tcg_out_mov(s, TCG_TYPE_I32, TCG_REG_R2, TCG_REG_R3);
+tcg_out_mov(s, TCG_TYPE_I32, TCG_REG_R3, TCG_REG_R1);
+}
+} else {
+tcg_out_mov(s, TCG_TYPE_I32, TCG_REG_R1, l->addrlo_reg);
+}
+tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_R0, TCG_AREG0);
+
+/*
+ * Tail call to the helper, with the return address back inline,
+ * just for the clarity of the debugging traceback -- the helper
+ * cannot return.  We have used BLNE to arrive here, so LR is
+ * already set.
+ */
+tcg_out_goto(s, COND_AL, (const void *)
+ (l->is_ld ? helper_unaligned_ld : helper_unaligned_st));
+return true;
+}
+
+static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
+{
+return tcg_out_fail_alignment(s, l);
+}
+
+static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
+{
+return tcg_out_fail_alignment(s, l);
+}
 #endif /* SOFTMMU */
 
 static void tcg_out_qemu_ld_index(TCGContext *s, MemOp opc,
@@ -1689,6 +1756,8 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg 
*args, bool is64)
 int mem_index;
 TCGReg addend;
 tcg_insn_unit *label_ptr;
+#else
+unsigned a_bits;
 #endif
 
 datalo = *args++;
@@ -1712,6 +1781,10 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg 
*args, bool is64)
 add_qemu_ldst_label(s, true, oi, datalo, datahi, addrlo, addrhi,
 s->code_ptr, label_ptr);
 #else /* !CONFIG_SOFTMMU */
+a_bits = get_alignment_bits(opc);
+if (a_bits) {
+tcg_out_test_alignment(s, true, addrlo, addrhi, a_bits);
+}
 if (guest_base) {
 tcg_out_qemu_ld_index(s, opc, datalo, datahi,
   addrlo, TCG_REG_GUEST_BASE, false);
@@ -1801,6 +1874,8 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg 
*args, bool is64)
 int mem_index;
 TCGReg addend;
 tcg_insn_unit *label_ptr;
+#else
+unsigned a_bits;
 #endif
 
 datalo = *args++;
@@ -1824,6 +1899,10 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg 
*args, bool is64)
 add

[PULL 31/34] tcg/sparc: Use the constant pool for 64-bit constants

2022-02-10 Thread Richard Henderson

Reviewed-by: Peter Maydell 
Signed-off-by: Richard Henderson 
---
 tcg/sparc/tcg-target.c.inc | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/tcg/sparc/tcg-target.c.inc b/tcg/sparc/tcg-target.c.inc
index 213aba4be6..e78945d153 100644
--- a/tcg/sparc/tcg-target.c.inc
+++ b/tcg/sparc/tcg-target.c.inc
@@ -336,6 +336,13 @@ static bool patch_reloc(tcg_insn_unit *src_rw, int type,
 insn &= ~INSN_OFF19(-1);
 insn |= INSN_OFF19(pcrel);
 break;
+case R_SPARC_13:
+if (!check_fit_ptr(value, 13)) {
+return false;
+}
+insn &= ~INSN_IMM13(-1);
+insn |= INSN_IMM13(value);
+break;
 default:
 g_assert_not_reached();
 }
@@ -479,6 +486,14 @@ static void tcg_out_movi_int(TCGContext *s, TCGType type, 
TCGReg ret,
 return;
 }
 
+/* Use the constant pool, if possible. */
+if (!in_prologue && USE_REG_TB) {
+new_pool_label(s, arg, R_SPARC_13, s->code_ptr,
+   tcg_tbrel_diff(s, NULL));
+tcg_out32(s, LDX | INSN_RD(ret) | INSN_RS1(TCG_REG_TB));
+return;
+}
+
 /* A 64-bit constant decomposed into 2 32-bit pieces.  */
 if (check_fit_i32(lo, 13)) {
 hi = (arg - lo) >> 32;
-- 
2.25.1

[PULL 11/34] tcg/aarch64: Support raising sigbus for user-only

2022-02-10 Thread Richard Henderson

Reviewed-by: Peter Maydell 
Signed-off-by: Richard Henderson 
---
 tcg/aarch64/tcg-target.h |  2 -
 tcg/aarch64/tcg-target.c.inc | 91 +---
 2 files changed, 74 insertions(+), 19 deletions(-)

diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
index 7a93ac8023..876af589ce 100644
--- a/tcg/aarch64/tcg-target.h
+++ b/tcg/aarch64/tcg-target.h
@@ -151,9 +151,7 @@ typedef enum {
 
 void tb_target_set_jmp_target(uintptr_t, uintptr_t, uintptr_t, uintptr_t);
 
-#ifdef CONFIG_SOFTMMU
 #define TCG_TARGET_NEED_LDST_LABELS
-#endif
 #define TCG_TARGET_NEED_POOL_LABELS
 
 #endif /* AARCH64_TCG_TARGET_H */
diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
index a8db553287..077fc51401 100644
--- a/tcg/aarch64/tcg-target.c.inc
+++ b/tcg/aarch64/tcg-target.c.inc
@@ -10,6 +10,7 @@
  * See the COPYING file in the top-level directory for details.
  */
 
+#include "../tcg-ldst.c.inc"
 #include "../tcg-pool.c.inc"
 #include "qemu/bitops.h"
 
@@ -443,6 +444,7 @@ typedef enum {
 I3404_ANDI  = 0x1200,
 I3404_ORRI  = 0x3200,
 I3404_EORI  = 0x5200,
+I3404_ANDSI = 0x7200,
 
 /* Move wide immediate instructions.  */
 I3405_MOVN  = 0x1280,
@@ -1328,8 +1330,9 @@ static void tcg_out_goto_long(TCGContext *s, const 
tcg_insn_unit *target)
 if (offset == sextract64(offset, 0, 26)) {
 tcg_out_insn(s, 3206, B, offset);
 } else {
-tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP, (intptr_t)target);
-tcg_out_insn(s, 3207, BR, TCG_REG_TMP);
+/* Choose X9 as a call-clobbered non-LR temporary. */
+tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_X9, (intptr_t)target);
+tcg_out_insn(s, 3207, BR, TCG_REG_X9);
 }
 }
 
@@ -1541,9 +1544,14 @@ static void tcg_out_cltz(TCGContext *s, TCGType ext, 
TCGReg d,
 }
 }
 
-#ifdef CONFIG_SOFTMMU
-#include "../tcg-ldst.c.inc"
+static void tcg_out_adr(TCGContext *s, TCGReg rd, const void *target)
+{
+ptrdiff_t offset = tcg_pcrel_diff(s, target);
+tcg_debug_assert(offset == sextract64(offset, 0, 21));
+tcg_out_insn(s, 3406, ADR, rd, offset);
+}
 
+#ifdef CONFIG_SOFTMMU
 /* helper signature: helper_ret_ld_mmu(CPUState *env, target_ulong addr,
  * MemOpIdx oi, uintptr_t ra)
  */
@@ -1577,13 +1585,6 @@ static void * const qemu_st_helpers[MO_SIZE + 1] = {
 #endif
 };
 
-static inline void tcg_out_adr(TCGContext *s, TCGReg rd, const void *target)
-{
-ptrdiff_t offset = tcg_pcrel_diff(s, target);
-tcg_debug_assert(offset == sextract64(offset, 0, 21));
-tcg_out_insn(s, 3406, ADR, rd, offset);
-}
-
 static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
 {
 MemOpIdx oi = lb->oi;
@@ -1714,15 +1715,58 @@ static void tcg_out_tlb_read(TCGContext *s, TCGReg 
addr_reg, MemOp opc,
 tcg_out_insn(s, 3202, B_C, TCG_COND_NE, 0);
 }
 
+#else
+static void tcg_out_test_alignment(TCGContext *s, bool is_ld, TCGReg addr_reg,
+   unsigned a_bits)
+{
+unsigned a_mask = (1 << a_bits) - 1;
+TCGLabelQemuLdst *label = new_ldst_label(s);
+
+label->is_ld = is_ld;
+label->addrlo_reg = addr_reg;
+
+/* tst addr, #mask */
+tcg_out_logicali(s, I3404_ANDSI, 0, TCG_REG_XZR, addr_reg, a_mask);
+
+label->label_ptr[0] = s->code_ptr;
+
+/* b.ne slow_path */
+tcg_out_insn(s, 3202, B_C, TCG_COND_NE, 0);
+
+label->raddr = tcg_splitwx_to_rx(s->code_ptr);
+}
+
+static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
+{
+if (!reloc_pc19(l->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
+return false;
+}
+
+tcg_out_mov(s, TCG_TYPE_TL, TCG_REG_X1, l->addrlo_reg);
+tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_X0, TCG_AREG0);
+
+/* "Tail call" to the helper, with the return address back inline. */
+tcg_out_adr(s, TCG_REG_LR, l->raddr);
+tcg_out_goto_long(s, (const void *)(l->is_ld ? helper_unaligned_ld
+: helper_unaligned_st));
+return true;
+}
+
+static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
+{
+return tcg_out_fail_alignment(s, l);
+}
+
+static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
+{
+return tcg_out_fail_alignment(s, l);
+}
 #endif /* CONFIG_SOFTMMU */
 
 static void tcg_out_qemu_ld_direct(TCGContext *s, MemOp memop, TCGType ext,
TCGReg data_r, TCGReg addr_r,
TCGType otype, TCGReg off_r)
 {
-/* Byte swapping is left to middle-end expansion. */
-tcg_debug_assert((memop & MO_BSWAP) == 0);
-
 switch (memop & MO_SSIZE) {
 case MO_UB:
 tcg_out_ldst_r(s, I3312_LDRB, data_r, addr_r, otype, off_r);
@@ -1756,9 +1800,6 @@ static void tcg_out_qemu_st_direct(TCGContext *s, MemOp 
memop,
TCGReg data_r, TCGReg addr_r,
TCGType oty

[PULL 22/34] tcg/arm: Reserve a register for guest_base

2022-02-10 Thread Richard Henderson

Reserve a register for the guest_base using aarch64 for reference.
By doing so, we do not have to recompute it for every memory load.

Reviewed-by: Peter Maydell 
Signed-off-by: Richard Henderson 
---
 tcg/arm/tcg-target.c.inc | 39 ---
 1 file changed, 28 insertions(+), 11 deletions(-)

diff --git a/tcg/arm/tcg-target.c.inc b/tcg/arm/tcg-target.c.inc
index d290b4556c..7eebbfaf02 100644
--- a/tcg/arm/tcg-target.c.inc
+++ b/tcg/arm/tcg-target.c.inc
@@ -84,6 +84,9 @@ static const int tcg_target_call_oarg_regs[2] = {
 
 #define TCG_REG_TMP  TCG_REG_R12
 #define TCG_VEC_TMP  TCG_REG_Q15
+#ifndef CONFIG_SOFTMMU
+#define TCG_REG_GUEST_BASE  TCG_REG_R11
+#endif
 
 typedef enum {
 COND_EQ = 0x0,
@@ -1593,7 +1596,8 @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, 
TCGLabelQemuLdst *lb)
 
 static void tcg_out_qemu_ld_index(TCGContext *s, MemOp opc,
   TCGReg datalo, TCGReg datahi,
-  TCGReg addrlo, TCGReg addend)
+  TCGReg addrlo, TCGReg addend,
+  bool scratch_addend)
 {
 /* Byte swapping is left to middle-end expansion. */
 tcg_debug_assert((opc & MO_BSWAP) == 0);
@@ -1619,7 +1623,7 @@ static void tcg_out_qemu_ld_index(TCGContext *s, MemOp 
opc,
 if (get_alignment_bits(opc) >= MO_64
 && (datalo & 1) == 0 && datahi == datalo + 1) {
 tcg_out_ldrd_r(s, COND_AL, datalo, addrlo, addend);
-} else if (datalo != addend) {
+} else if (scratch_addend) {
 tcg_out_ld32_rwb(s, COND_AL, datalo, addend, addrlo);
 tcg_out_ld32_12(s, COND_AL, datahi, addend, 4);
 } else {
@@ -1703,14 +1707,14 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg 
*args, bool is64)
 label_ptr = s->code_ptr;
 tcg_out_bl_imm(s, COND_NE, 0);
 
-tcg_out_qemu_ld_index(s, opc, datalo, datahi, addrlo, addend);
+tcg_out_qemu_ld_index(s, opc, datalo, datahi, addrlo, addend, true);
 
 add_qemu_ldst_label(s, true, oi, datalo, datahi, addrlo, addrhi,
 s->code_ptr, label_ptr);
 #else /* !CONFIG_SOFTMMU */
 if (guest_base) {
-tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_TMP, guest_base);
-tcg_out_qemu_ld_index(s, opc, datalo, datahi, addrlo, TCG_REG_TMP);
+tcg_out_qemu_ld_index(s, opc, datalo, datahi,
+  addrlo, TCG_REG_GUEST_BASE, false);
 } else {
 tcg_out_qemu_ld_direct(s, opc, datalo, datahi, addrlo);
 }
@@ -1719,7 +1723,8 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg 
*args, bool is64)
 
 static void tcg_out_qemu_st_index(TCGContext *s, ARMCond cond, MemOp opc,
   TCGReg datalo, TCGReg datahi,
-  TCGReg addrlo, TCGReg addend)
+  TCGReg addrlo, TCGReg addend,
+  bool scratch_addend)
 {
 /* Byte swapping is left to middle-end expansion. */
 tcg_debug_assert((opc & MO_BSWAP) == 0);
@@ -1739,9 +1744,14 @@ static void tcg_out_qemu_st_index(TCGContext *s, ARMCond 
cond, MemOp opc,
 if (get_alignment_bits(opc) >= MO_64
 && (datalo & 1) == 0 && datahi == datalo + 1) {
 tcg_out_strd_r(s, cond, datalo, addrlo, addend);
-} else {
+} else if (scratch_addend) {
 tcg_out_st32_rwb(s, cond, datalo, addend, addrlo);
 tcg_out_st32_12(s, cond, datahi, addend, 4);
+} else {
+tcg_out_dat_reg(s, cond, ARITH_ADD, TCG_REG_TMP,
+addend, addrlo, SHIFT_IMM_LSL(0));
+tcg_out_st32_12(s, cond, datalo, TCG_REG_TMP, 0);
+tcg_out_st32_12(s, cond, datahi, TCG_REG_TMP, 4);
 }
 break;
 default:
@@ -1804,7 +1814,8 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg 
*args, bool is64)
 mem_index = get_mmuidx(oi);
 addend = tcg_out_tlb_read(s, addrlo, addrhi, opc, mem_index, 0);
 
-tcg_out_qemu_st_index(s, COND_EQ, opc, datalo, datahi, addrlo, addend);
+tcg_out_qemu_st_index(s, COND_EQ, opc, datalo, datahi,
+  addrlo, addend, true);
 
 /* The conditional call must come last, as we're going to return here.  */
 label_ptr = s->code_ptr;
@@ -1814,9 +1825,8 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg 
*args, bool is64)
 s->code_ptr, label_ptr);
 #else /* !CONFIG_SOFTMMU */
 if (guest_base) {
-tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_TMP, guest_base);
-tcg_out_qemu_st_index(s, COND_AL, opc, datalo,
-  datahi, addrlo, TCG_REG_TMP);
+tcg_out_qemu_st_index(s, COND_AL, opc, datalo, datahi,
+  addrlo, TCG_REG_GUEST_BASE, false);
 } else {
 tcg_out_qemu_st_direct(s, opc, datalo, datahi, addrlo);
 }
@@ -2958,6 +2968,13 @@ static void tcg_target

[PULL 33/34] tcg/sparc: Support unaligned access for user-only

2022-02-10 Thread Richard Henderson

This is kinda sorta the opposite of the other tcg hosts, where
we get (normal) alignment checks for free with host SIGBUS and
need to add code to support unaligned accesses.

This inline code expansion is somewhat large, but it takes quite
a few instructions to make a function call to a helper anyway.

Reviewed-by: Peter Maydell 
Signed-off-by: Richard Henderson 
---
 tcg/sparc/tcg-target.c.inc | 219 +++--
 1 file changed, 211 insertions(+), 8 deletions(-)

diff --git a/tcg/sparc/tcg-target.c.inc b/tcg/sparc/tcg-target.c.inc
index 646bb462c3..72d9552fd0 100644
--- a/tcg/sparc/tcg-target.c.inc
+++ b/tcg/sparc/tcg-target.c.inc
@@ -211,6 +211,7 @@ static const int tcg_target_call_oarg_regs[] = {
 #define ARITH_ADD  (INSN_OP(2) | INSN_OP3(0x00))
 #define ARITH_ADDCC (INSN_OP(2) | INSN_OP3(0x10))
 #define ARITH_AND  (INSN_OP(2) | INSN_OP3(0x01))
+#define ARITH_ANDCC (INSN_OP(2) | INSN_OP3(0x11))
 #define ARITH_ANDN (INSN_OP(2) | INSN_OP3(0x05))
 #define ARITH_OR   (INSN_OP(2) | INSN_OP3(0x02))
 #define ARITH_ORCC (INSN_OP(2) | INSN_OP3(0x12))
@@ -1025,6 +1026,38 @@ static void build_trampolines(TCGContext *s)
 tcg_out_mov_delay(s, TCG_REG_O0, TCG_AREG0);
 }
 }
+#else
+static const tcg_insn_unit *qemu_unalign_ld_trampoline;
+static const tcg_insn_unit *qemu_unalign_st_trampoline;
+
+static void build_trampolines(TCGContext *s)
+{
+for (int ld = 0; ld < 2; ++ld) {
+void *helper;
+
+while ((uintptr_t)s->code_ptr & 15) {
+tcg_out_nop(s);
+}
+
+if (ld) {
+helper = helper_unaligned_ld;
+qemu_unalign_ld_trampoline = tcg_splitwx_to_rx(s->code_ptr);
+} else {
+helper = helper_unaligned_st;
+qemu_unalign_st_trampoline = tcg_splitwx_to_rx(s->code_ptr);
+}
+
+if (!SPARC64 && TARGET_LONG_BITS == 64) {
+/* Install the high part of the address.  */
+tcg_out_arithi(s, TCG_REG_O1, TCG_REG_O2, 32, SHIFT_SRLX);
+}
+
+/* Tail call.  */
+tcg_out_jmpl_const(s, helper, true, true);
+/* delay slot -- set the env argument */
+tcg_out_mov_delay(s, TCG_REG_O0, TCG_AREG0);
+}
+}
 #endif
 
 /* Generate global QEMU prologue and epilogue code */
@@ -1075,9 +1108,7 @@ static void tcg_target_qemu_prologue(TCGContext *s)
 /* delay slot */
 tcg_out_movi_imm13(s, TCG_REG_O0, 0);
 
-#ifdef CONFIG_SOFTMMU
 build_trampolines(s);
-#endif
 }
 
 static void tcg_out_nop_fill(tcg_insn_unit *p, int count)
@@ -1162,18 +1193,22 @@ static TCGReg tcg_out_tlb_load(TCGContext *s, TCGReg 
addr, int mem_index,
 static const int qemu_ld_opc[(MO_SSIZE | MO_BSWAP) + 1] = {
 [MO_UB]   = LDUB,
 [MO_SB]   = LDSB,
+[MO_UB | MO_LE] = LDUB,
+[MO_SB | MO_LE] = LDSB,
 
 [MO_BEUW] = LDUH,
 [MO_BESW] = LDSH,
 [MO_BEUL] = LDUW,
 [MO_BESL] = LDSW,
 [MO_BEUQ] = LDX,
+[MO_BESQ] = LDX,
 
 [MO_LEUW] = LDUH_LE,
 [MO_LESW] = LDSH_LE,
 [MO_LEUL] = LDUW_LE,
 [MO_LESL] = LDSW_LE,
 [MO_LEUQ] = LDX_LE,
+[MO_LESQ] = LDX_LE,
 };
 
 static const int qemu_st_opc[(MO_SIZE | MO_BSWAP) + 1] = {
@@ -1192,11 +1227,12 @@ static void tcg_out_qemu_ld(TCGContext *s, TCGReg data, 
TCGReg addr,
 MemOpIdx oi, bool is_64)
 {
 MemOp memop = get_memop(oi);
+tcg_insn_unit *label_ptr;
+
 #ifdef CONFIG_SOFTMMU
 unsigned memi = get_mmuidx(oi);
 TCGReg addrz, param;
 const tcg_insn_unit *func;
-tcg_insn_unit *label_ptr;
 
 addrz = tcg_out_tlb_load(s, addr, memi, memop,
  offsetof(CPUTLBEntry, addr_read));
@@ -1260,13 +1296,99 @@ static void tcg_out_qemu_ld(TCGContext *s, TCGReg data, 
TCGReg addr,
 
 *label_ptr |= INSN_OFF19(tcg_ptr_byte_diff(s->code_ptr, label_ptr));
 #else
+TCGReg index = (guest_base ? TCG_GUEST_BASE_REG : TCG_REG_G0);
+unsigned a_bits = get_alignment_bits(memop);
+unsigned s_bits = memop & MO_SIZE;
+unsigned t_bits;
+
 if (SPARC64 && TARGET_LONG_BITS == 32) {
 tcg_out_arithi(s, TCG_REG_T1, addr, 0, SHIFT_SRL);
 addr = TCG_REG_T1;
 }
-tcg_out_ldst_rr(s, data, addr,
-(guest_base ? TCG_GUEST_BASE_REG : TCG_REG_G0),
+
+/*
+ * Normal case: alignment equal to access size.
+ */
+if (a_bits == s_bits) {
+tcg_out_ldst_rr(s, data, addr, index,
+qemu_ld_opc[memop & (MO_BSWAP | MO_SSIZE)]);
+return;
+}
+
+/*
+ * Test for at least natural alignment, and assume most accesses
+ * will be aligned -- perform a straight load in the delay slot.
+ * This is required to preserve atomicity for aligned accesses.
+ */
+t_bits = MAX(a_bits, s_bits);
+tcg_debug_assert(t_bits < 13);
+tcg_out_arithi(s, TCG_REG_G0, addr, (1u << t_bits) - 1, ARITH_ANDCC);
+
+/* beq,a,pt %icc, label */
+label_ptr = s->code_ptr;
+tcg_out_bpcc0(s, COND_E, BPCC_A | BPCC_PT | BPCC_ICC, 0);

[PULL 18/34] tcg/arm: Remove use_armv5t_instructions

2022-02-10 Thread Richard Henderson

This is now always true, since we require armv6.

Reviewed-by: Peter Maydell 
Signed-off-by: Richard Henderson 
---
 tcg/arm/tcg-target.h |  3 +--
 tcg/arm/tcg-target.c.inc | 35 ++-
 2 files changed, 7 insertions(+), 31 deletions(-)

diff --git a/tcg/arm/tcg-target.h b/tcg/arm/tcg-target.h
index f41b809554..5c9ba5feea 100644
--- a/tcg/arm/tcg-target.h
+++ b/tcg/arm/tcg-target.h
@@ -28,7 +28,6 @@
 
 extern int arm_arch;
 
-#define use_armv5t_instructions (__ARM_ARCH >= 5 || arm_arch >= 5)
 #define use_armv6_instructions  (__ARM_ARCH >= 6 || arm_arch >= 6)
 #define use_armv7_instructions  (__ARM_ARCH >= 7 || arm_arch >= 7)
 
@@ -109,7 +108,7 @@ extern bool use_neon_instructions;
 #define TCG_TARGET_HAS_eqv_i32  0
 #define TCG_TARGET_HAS_nand_i32 0
 #define TCG_TARGET_HAS_nor_i32  0
-#define TCG_TARGET_HAS_clz_i32  use_armv5t_instructions
+#define TCG_TARGET_HAS_clz_i32  1
 #define TCG_TARGET_HAS_ctz_i32  use_armv7_instructions
 #define TCG_TARGET_HAS_ctpop_i320
 #define TCG_TARGET_HAS_deposit_i32  use_armv7_instructions
diff --git a/tcg/arm/tcg-target.c.inc b/tcg/arm/tcg-target.c.inc
index 29d63e98a8..f3b635063f 100644
--- a/tcg/arm/tcg-target.c.inc
+++ b/tcg/arm/tcg-target.c.inc
@@ -596,11 +596,7 @@ static void tcg_out_b_reg(TCGContext *s, ARMCond cond, 
TCGReg rn)
  * Unless the C portion of QEMU is compiled as thumb, we don't need
  * true BX semantics; merely a branch to an address held in a register.
  */
-if (use_armv5t_instructions) {
-tcg_out_bx_reg(s, cond, rn);
-} else {
-tcg_out_mov_reg(s, cond, TCG_REG_PC, rn);
-}
+tcg_out_bx_reg(s, cond, rn);
 }
 
 static void tcg_out_dat_imm(TCGContext *s, ARMCond cond, ARMInsn opc,
@@ -1247,14 +1243,7 @@ static void tcg_out_goto(TCGContext *s, ARMCond cond, 
const tcg_insn_unit *addr)
 }
 
 /* LDR is interworking from v5t. */
-if (arm_mode || use_armv5t_instructions) {
-tcg_out_movi_pool(s, cond, TCG_REG_PC, addri);
-return;
-}
-
-/* else v4t */
-tcg_out_movi32(s, COND_AL, TCG_REG_TMP, addri);
-tcg_out_bx_reg(s, COND_AL, TCG_REG_TMP);
+tcg_out_movi_pool(s, cond, TCG_REG_PC, addri);
 }
 
 /*
@@ -1270,26 +1259,14 @@ static void tcg_out_call(TCGContext *s, const 
tcg_insn_unit *addr)
 if (disp - 8 < 0x0200 && disp - 8 >= -0x0200) {
 if (arm_mode) {
 tcg_out_bl_imm(s, COND_AL, disp);
-return;
-}
-if (use_armv5t_instructions) {
+} else {
 tcg_out_blx_imm(s, disp);
-return;
 }
+return;
 }
 
-if (use_armv5t_instructions) {
-tcg_out_movi32(s, COND_AL, TCG_REG_TMP, addri);
-tcg_out_blx_reg(s, COND_AL, TCG_REG_TMP);
-} else if (arm_mode) {
-/* ??? Know that movi_pool emits exactly 1 insn.  */
-tcg_out_mov_reg(s, COND_AL, TCG_REG_R14, TCG_REG_PC);
-tcg_out_movi_pool(s, COND_AL, TCG_REG_PC, addri);
-} else {
-tcg_out_movi32(s, COND_AL, TCG_REG_TMP, addri);
-tcg_out_mov_reg(s, COND_AL, TCG_REG_R14, TCG_REG_PC);
-tcg_out_bx_reg(s, COND_AL, TCG_REG_TMP);
-}
+tcg_out_movi32(s, COND_AL, TCG_REG_TMP, addri);
+tcg_out_blx_reg(s, COND_AL, TCG_REG_TMP);
 }
 
 static void tcg_out_goto_label(TCGContext *s, ARMCond cond, TCGLabel *l)
-- 
2.25.1

[PULL 20/34] tcg/arm: Check alignment for ldrd and strd

2022-02-10 Thread Richard Henderson

We will shortly allow the use of unaligned memory accesses,
and these require proper alignment.  Use get_alignment_bits
to verify and remove USING_SOFTMMU.

Reviewed-by: Peter Maydell 
Signed-off-by: Richard Henderson 
---
 tcg/arm/tcg-target.c.inc | 23 ---
 1 file changed, 8 insertions(+), 15 deletions(-)

diff --git a/tcg/arm/tcg-target.c.inc b/tcg/arm/tcg-target.c.inc
index 9eb43407ea..4b0b4f4c2f 100644
--- a/tcg/arm/tcg-target.c.inc
+++ b/tcg/arm/tcg-target.c.inc
@@ -34,13 +34,6 @@ bool use_idiv_instructions;
 bool use_neon_instructions;
 #endif
 
-/* ??? Ought to think about changing CONFIG_SOFTMMU to always defined.  */
-#ifdef CONFIG_SOFTMMU
-# define USING_SOFTMMU 1
-#else
-# define USING_SOFTMMU 0
-#endif
-
 #ifdef CONFIG_DEBUG_TCG
 static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
 "%r0",  "%r1",  "%r2",  "%r3",  "%r4",  "%r5",  "%r6",  "%r7",
@@ -1621,8 +1614,8 @@ static void tcg_out_qemu_ld_index(TCGContext *s, MemOp 
opc,
 tcg_out_ld32_r(s, COND_AL, datalo, addrlo, addend);
 break;
 case MO_UQ:
-/* Avoid ldrd for user-only emulation, to handle unaligned.  */
-if (USING_SOFTMMU
+/* LDRD requires alignment; double-check that. */
+if (get_alignment_bits(opc) >= MO_64
 && (datalo & 1) == 0 && datahi == datalo + 1) {
 tcg_out_ldrd_r(s, COND_AL, datalo, addrlo, addend);
 } else if (datalo != addend) {
@@ -1664,8 +1657,8 @@ static void tcg_out_qemu_ld_direct(TCGContext *s, MemOp 
opc, TCGReg datalo,
 tcg_out_ld32_12(s, COND_AL, datalo, addrlo, 0);
 break;
 case MO_UQ:
-/* Avoid ldrd for user-only emulation, to handle unaligned.  */
-if (USING_SOFTMMU
+/* LDRD requires alignment; double-check that. */
+if (get_alignment_bits(opc) >= MO_64
 && (datalo & 1) == 0 && datahi == datalo + 1) {
 tcg_out_ldrd_8(s, COND_AL, datalo, addrlo, 0);
 } else if (datalo == addrlo) {
@@ -1741,8 +1734,8 @@ static void tcg_out_qemu_st_index(TCGContext *s, ARMCond 
cond, MemOp opc,
 tcg_out_st32_r(s, cond, datalo, addrlo, addend);
 break;
 case MO_64:
-/* Avoid strd for user-only emulation, to handle unaligned.  */
-if (USING_SOFTMMU
+/* STRD requires alignment; double-check that. */
+if (get_alignment_bits(opc) >= MO_64
 && (datalo & 1) == 0 && datahi == datalo + 1) {
 tcg_out_strd_r(s, cond, datalo, addrlo, addend);
 } else {
@@ -1773,8 +1766,8 @@ static void tcg_out_qemu_st_direct(TCGContext *s, MemOp 
opc, TCGReg datalo,
 tcg_out_st32_12(s, COND_AL, datalo, addrlo, 0);
 break;
 case MO_64:
-/* Avoid strd for user-only emulation, to handle unaligned.  */
-if (USING_SOFTMMU
+/* STRD requires alignment; double-check that. */
+if (get_alignment_bits(opc) >= MO_64
 && (datalo & 1) == 0 && datahi == datalo + 1) {
 tcg_out_strd_8(s, COND_AL, datalo, addrlo, 0);
 } else {
-- 
2.25.1

[PULL 29/34] tcg/sparc: Improve code gen for shifted 32-bit constants

2022-02-10 Thread Richard Henderson

We had code for checking for 13 and 21-bit shifted constants,
but we can do better and allow 32-bit shifted constants.
This is still 2 insns shorter than the full 64-bit sequence.

Reviewed-by: Peter Maydell 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Richard Henderson 
---
 tcg/sparc/tcg-target.c.inc | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/tcg/sparc/tcg-target.c.inc b/tcg/sparc/tcg-target.c.inc
index 7a8f20ee9a..ed2f4ecc40 100644
--- a/tcg/sparc/tcg-target.c.inc
+++ b/tcg/sparc/tcg-target.c.inc
@@ -462,17 +462,17 @@ static void tcg_out_movi_int(TCGContext *s, TCGType type, 
TCGReg ret,
 return;
 }
 
-/* A 21-bit constant, shifted.  */
+/* A 32-bit constant, shifted.  */
 lsb = ctz64(arg);
 test = (tcg_target_long)arg >> lsb;
-if (check_fit_tl(test, 13)) {
-tcg_out_movi_imm13(s, ret, test);
-tcg_out_arithi(s, ret, ret, lsb, SHIFT_SLLX);
-return;
-} else if (lsb > 10 && test == extract64(test, 0, 21)) {
+if (lsb > 10 && test == extract64(test, 0, 21)) {
 tcg_out_sethi(s, ret, test << 10);
 tcg_out_arithi(s, ret, ret, lsb - 10, SHIFT_SLLX);
 return;
+} else if (test == (uint32_t)test || test == (int32_t)test) {
+tcg_out_movi_int(s, TCG_TYPE_I64, ret, test, in_prologue, scratch);
+tcg_out_arithi(s, ret, ret, lsb, SHIFT_SLLX);
+return;
 }
 
 /* A 64-bit constant decomposed into 2 32-bit pieces.  */
-- 
2.25.1

[PULL 10/34] tcg/i386: Support raising sigbus for user-only

2022-02-10 Thread Richard Henderson

Reviewed-by: Peter Maydell 
Signed-off-by: Richard Henderson 
---
 tcg/i386/tcg-target.h |   2 -
 tcg/i386/tcg-target.c.inc | 103 --
 2 files changed, 98 insertions(+), 7 deletions(-)

diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index b00a6da293..3b2c9437a0 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -232,9 +232,7 @@ static inline void tb_target_set_jmp_target(uintptr_t 
tc_ptr, uintptr_t jmp_rx,
 
 #define TCG_TARGET_HAS_MEMORY_BSWAP  have_movbe
 
-#ifdef CONFIG_SOFTMMU
 #define TCG_TARGET_NEED_LDST_LABELS
-#endif
 #define TCG_TARGET_NEED_POOL_LABELS
 
 #endif
diff --git a/tcg/i386/tcg-target.c.inc b/tcg/i386/tcg-target.c.inc
index 4dab09f265..faa15eecab 100644
--- a/tcg/i386/tcg-target.c.inc
+++ b/tcg/i386/tcg-target.c.inc
@@ -22,6 +22,7 @@
  * THE SOFTWARE.
  */
 
+#include "../tcg-ldst.c.inc"
 #include "../tcg-pool.c.inc"
 
 #ifdef CONFIG_DEBUG_TCG
@@ -421,8 +422,9 @@ static bool tcg_target_const_match(int64_t val, TCGType 
type, int ct)
 #define OPC_VZEROUPPER  (0x77 | P_EXT)
 #define OPC_XCHG_ax_r32(0x90)
 
-#define OPC_GRP3_Ev(0xf7)
-#define OPC_GRP5   (0xff)
+#define OPC_GRP3_Eb (0xf6)
+#define OPC_GRP3_Ev (0xf7)
+#define OPC_GRP5(0xff)
 #define OPC_GRP14   (0x73 | P_EXT | P_DATA16)
 
 /* Group 1 opcode extensions for 0x80-0x83.
@@ -444,6 +446,7 @@ static bool tcg_target_const_match(int64_t val, TCGType 
type, int ct)
 #define SHIFT_SAR 7
 
 /* Group 3 opcode extensions for 0xf6, 0xf7.  To be used with OPC_GRP3.  */
+#define EXT3_TESTi 0
 #define EXT3_NOT   2
 #define EXT3_NEG   3
 #define EXT3_MUL   4
@@ -1606,8 +1609,6 @@ static void tcg_out_nopn(TCGContext *s, int n)
 }
 
 #if defined(CONFIG_SOFTMMU)
-#include "../tcg-ldst.c.inc"
-
 /* helper signature: helper_ret_ld_mmu(CPUState *env, target_ulong addr,
  * int mmu_idx, uintptr_t ra)
  */
@@ -1916,7 +1917,84 @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, 
TCGLabelQemuLdst *l)
 tcg_out_jmp(s, qemu_st_helpers[opc & (MO_BSWAP | MO_SIZE)]);
 return true;
 }
-#elif TCG_TARGET_REG_BITS == 32
+#else
+
+static void tcg_out_test_alignment(TCGContext *s, bool is_ld, TCGReg addrlo,
+   TCGReg addrhi, unsigned a_bits)
+{
+unsigned a_mask = (1 << a_bits) - 1;
+TCGLabelQemuLdst *label;
+
+/*
+ * We are expecting a_bits to max out at 7, so we can usually use testb.
+ * For i686, we have to use testl for %esi/%edi.
+ */
+if (a_mask <= 0xff && (TCG_TARGET_REG_BITS == 64 || addrlo < 4)) {
+tcg_out_modrm(s, OPC_GRP3_Eb | P_REXB_RM, EXT3_TESTi, addrlo);
+tcg_out8(s, a_mask);
+} else {
+tcg_out_modrm(s, OPC_GRP3_Ev, EXT3_TESTi, addrlo);
+tcg_out32(s, a_mask);
+}
+
+/* jne slow_path */
+tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0);
+
+label = new_ldst_label(s);
+label->is_ld = is_ld;
+label->addrlo_reg = addrlo;
+label->addrhi_reg = addrhi;
+label->raddr = tcg_splitwx_to_rx(s->code_ptr + 4);
+label->label_ptr[0] = s->code_ptr;
+
+s->code_ptr += 4;
+}
+
+static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
+{
+/* resolve label address */
+tcg_patch32(l->label_ptr[0], s->code_ptr - l->label_ptr[0] - 4);
+
+if (TCG_TARGET_REG_BITS == 32) {
+int ofs = 0;
+
+tcg_out_st(s, TCG_TYPE_PTR, TCG_AREG0, TCG_REG_ESP, ofs);
+ofs += 4;
+
+tcg_out_st(s, TCG_TYPE_I32, l->addrlo_reg, TCG_REG_ESP, ofs);
+ofs += 4;
+if (TARGET_LONG_BITS == 64) {
+tcg_out_st(s, TCG_TYPE_I32, l->addrhi_reg, TCG_REG_ESP, ofs);
+ofs += 4;
+}
+
+tcg_out_pushi(s, (uintptr_t)l->raddr);
+} else {
+tcg_out_mov(s, TCG_TYPE_TL, tcg_target_call_iarg_regs[1],
+l->addrlo_reg);
+tcg_out_mov(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[0], TCG_AREG0);
+
+tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_RAX, (uintptr_t)l->raddr);
+tcg_out_push(s, TCG_REG_RAX);
+}
+
+/* "Tail call" to the helper, with the return address back inline. */
+tcg_out_jmp(s, (const void *)(l->is_ld ? helper_unaligned_ld
+  : helper_unaligned_st));
+return true;
+}
+
+static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
+{
+return tcg_out_fail_alignment(s, l);
+}
+
+static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
+{
+return tcg_out_fail_alignment(s, l);
+}
+
+#if TCG_TARGET_REG_BITS == 32
 # define x86_guest_base_seg 0
 # define x86_guest_base_index   -1
 # define x86_guest_base_offset  guest_base
@@ -1950,6 +2028,7 @@ static inline int setup_guest_base_seg(void)
 return 0;
 }
 # endif
+#endif
 #endif /* SOFTMMU */
 
 static void tcg_out_qemu_ld_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
@@ -2059,6 +2138,8 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg 
*arg

[PULL 07/34] softmmu/cpus: Check if the cpu work list is empty atomically

2022-02-10 Thread Richard Henderson

From: Idan Horowitz 

Instead of taking the lock of the cpu work list in order to check if it's
empty, we can just read the head pointer atomically. This decreases
cpu_work_list_empty's share from 5% to 1.3% in a profile of icount-enabled
aarch64-softmmu.

Signed-off-by: Idan Horowitz 
Message-Id: <20220114004358.299534-1-idan.horow...@gmail.com>
Reviewed-by: Richard Henderson 
Signed-off-by: Richard Henderson 
---
 softmmu/cpus.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/softmmu/cpus.c b/softmmu/cpus.c
index 23bca46b07..035395ae13 100644
--- a/softmmu/cpus.c
+++ b/softmmu/cpus.c
@@ -73,12 +73,7 @@ bool cpu_is_stopped(CPUState *cpu)
 
 bool cpu_work_list_empty(CPUState *cpu)
 {
-bool ret;
-
-qemu_mutex_lock(&cpu->work_mutex);
-ret = QSIMPLEQ_EMPTY(&cpu->work_list);
-qemu_mutex_unlock(&cpu->work_mutex);
-return ret;
+return QSIMPLEQ_EMPTY_ATOMIC(&cpu->work_list);
 }
 
 bool cpu_thread_is_idle(CPUState *cpu)
-- 
2.25.1

[PULL 21/34] tcg/arm: Support unaligned access for softmmu

2022-02-10 Thread Richard Henderson

>From armv6, the architecture supports unaligned accesses.
All we need to do is perform the correct alignment check
in tcg_out_tlb_read.

Reviewed-by: Peter Maydell 
Signed-off-by: Richard Henderson 
---
 tcg/arm/tcg-target.c.inc | 41 
 1 file changed, 21 insertions(+), 20 deletions(-)

diff --git a/tcg/arm/tcg-target.c.inc b/tcg/arm/tcg-target.c.inc
index 4b0b4f4c2f..d290b4556c 100644
--- a/tcg/arm/tcg-target.c.inc
+++ b/tcg/arm/tcg-target.c.inc
@@ -1396,16 +1396,9 @@ static TCGReg tcg_out_tlb_read(TCGContext *s, TCGReg 
addrlo, TCGReg addrhi,
 int cmp_off = (is_load ? offsetof(CPUTLBEntry, addr_read)
: offsetof(CPUTLBEntry, addr_write));
 int fast_off = TLB_MASK_TABLE_OFS(mem_index);
-unsigned s_bits = opc & MO_SIZE;
-unsigned a_bits = get_alignment_bits(opc);
-
-/*
- * We don't support inline unaligned acceses, but we can easily
- * support overalignment checks.
- */
-if (a_bits < s_bits) {
-a_bits = s_bits;
-}
+unsigned s_mask = (1 << (opc & MO_SIZE)) - 1;
+unsigned a_mask = (1 << get_alignment_bits(opc)) - 1;
+TCGReg t_addr;
 
 /* Load env_tlb(env)->f[mmu_idx].{mask,table} into {r0,r1}.  */
 tcg_out_ldrd_8(s, COND_AL, TCG_REG_R0, TCG_AREG0, fast_off);
@@ -1440,27 +1433,35 @@ static TCGReg tcg_out_tlb_read(TCGContext *s, TCGReg 
addrlo, TCGReg addrhi,
 
 /*
  * Check alignment, check comparators.
- * Do this in no more than 3 insns.  Use MOVW for v7, if possible,
+ * Do this in 2-4 insns.  Use MOVW for v7, if possible,
  * to reduce the number of sequential conditional instructions.
  * Almost all guests have at least 4k pages, which means that we need
  * to clear at least 9 bits even for an 8-byte memory, which means it
  * isn't worth checking for an immediate operand for BIC.
+ *
+ * For unaligned accesses, test the page of the last unit of alignment.
+ * This leaves the least significant alignment bits unchanged, and of
+ * course must be zero.
  */
+t_addr = addrlo;
+if (a_mask < s_mask) {
+t_addr = TCG_REG_R0;
+tcg_out_dat_imm(s, COND_AL, ARITH_ADD, t_addr,
+addrlo, s_mask - a_mask);
+}
 if (use_armv7_instructions && TARGET_PAGE_BITS <= 16) {
-tcg_target_ulong mask = ~(TARGET_PAGE_MASK | ((1 << a_bits) - 1));
-
-tcg_out_movi32(s, COND_AL, TCG_REG_TMP, mask);
+tcg_out_movi32(s, COND_AL, TCG_REG_TMP, ~(TARGET_PAGE_MASK | a_mask));
 tcg_out_dat_reg(s, COND_AL, ARITH_BIC, TCG_REG_TMP,
-addrlo, TCG_REG_TMP, 0);
+t_addr, TCG_REG_TMP, 0);
 tcg_out_dat_reg(s, COND_AL, ARITH_CMP, 0, TCG_REG_R2, TCG_REG_TMP, 0);
 } else {
-if (a_bits) {
-tcg_out_dat_imm(s, COND_AL, ARITH_TST, 0, addrlo,
-(1 << a_bits) - 1);
+if (a_mask) {
+tcg_debug_assert(a_mask <= 0xff);
+tcg_out_dat_imm(s, COND_AL, ARITH_TST, 0, addrlo, a_mask);
 }
-tcg_out_dat_reg(s, COND_AL, ARITH_MOV, TCG_REG_TMP, 0, addrlo,
+tcg_out_dat_reg(s, COND_AL, ARITH_MOV, TCG_REG_TMP, 0, t_addr,
 SHIFT_IMM_LSR(TARGET_PAGE_BITS));
-tcg_out_dat_reg(s, (a_bits ? COND_EQ : COND_AL), ARITH_CMP,
+tcg_out_dat_reg(s, (a_mask ? COND_EQ : COND_AL), ARITH_CMP,
 0, TCG_REG_R2, TCG_REG_TMP,
 SHIFT_IMM_LSL(TARGET_PAGE_BITS));
 }
-- 
2.25.1

[PULL 24/34] tcg/mips: Support unaligned access for user-only

2022-02-10 Thread Richard Henderson

This is kinda sorta the opposite of the other tcg hosts, where
we get (normal) alignment checks for free with host SIGBUS and
need to add code to support unaligned accesses.

Fortunately, the ISA contains pairs of instructions that are
used to implement unaligned memory accesses.  Use them.

Tested-by: Jiaxun Yang 
Reviewed-by: Jiaxun Yang 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Richard Henderson 
---
 tcg/mips/tcg-target.h |   2 -
 tcg/mips/tcg-target.c.inc | 334 +-
 2 files changed, 328 insertions(+), 8 deletions(-)

diff --git a/tcg/mips/tcg-target.h b/tcg/mips/tcg-target.h
index c366fdf74b..7669213175 100644
--- a/tcg/mips/tcg-target.h
+++ b/tcg/mips/tcg-target.h
@@ -207,8 +207,6 @@ extern bool use_mips32r2_instructions;
 void tb_target_set_jmp_target(uintptr_t, uintptr_t, uintptr_t, uintptr_t)
 QEMU_ERROR("code path is reachable");
 
-#ifdef CONFIG_SOFTMMU
 #define TCG_TARGET_NEED_LDST_LABELS
-#endif
 
 #endif
diff --git a/tcg/mips/tcg-target.c.inc b/tcg/mips/tcg-target.c.inc
index 27b020e66c..2c94ac2ed6 100644
--- a/tcg/mips/tcg-target.c.inc
+++ b/tcg/mips/tcg-target.c.inc
@@ -24,6 +24,8 @@
  * THE SOFTWARE.
  */
 
+#include "../tcg-ldst.c.inc"
+
 #ifdef HOST_WORDS_BIGENDIAN
 # define MIPS_BE  1
 #else
@@ -230,16 +232,26 @@ typedef enum {
 OPC_ORI  = 015 << 26,
 OPC_XORI = 016 << 26,
 OPC_LUI  = 017 << 26,
+OPC_BNEL = 025 << 26,
+OPC_BNEZALC_R6 = 030 << 26,
 OPC_DADDIU   = 031 << 26,
+OPC_LDL  = 032 << 26,
+OPC_LDR  = 033 << 26,
 OPC_LB   = 040 << 26,
 OPC_LH   = 041 << 26,
+OPC_LWL  = 042 << 26,
 OPC_LW   = 043 << 26,
 OPC_LBU  = 044 << 26,
 OPC_LHU  = 045 << 26,
+OPC_LWR  = 046 << 26,
 OPC_LWU  = 047 << 26,
 OPC_SB   = 050 << 26,
 OPC_SH   = 051 << 26,
+OPC_SWL  = 052 << 26,
 OPC_SW   = 053 << 26,
+OPC_SDL  = 054 << 26,
+OPC_SDR  = 055 << 26,
+OPC_SWR  = 056 << 26,
 OPC_LD   = 067 << 26,
 OPC_SD   = 077 << 26,
 
@@ -1015,8 +1027,6 @@ static void tcg_out_call(TCGContext *s, const 
tcg_insn_unit *arg)
 }
 
 #if defined(CONFIG_SOFTMMU)
-#include "../tcg-ldst.c.inc"
-
 static void * const qemu_ld_helpers[(MO_SSIZE | MO_BSWAP) + 1] = {
 [MO_UB]   = helper_ret_ldub_mmu,
 [MO_SB]   = helper_ret_ldsb_mmu,
@@ -1324,7 +1334,82 @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, 
TCGLabelQemuLdst *l)
 tcg_out_mov(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[0], TCG_AREG0);
 return true;
 }
-#endif
+
+#else
+
+static void tcg_out_test_alignment(TCGContext *s, bool is_ld, TCGReg addrlo,
+   TCGReg addrhi, unsigned a_bits)
+{
+unsigned a_mask = (1 << a_bits) - 1;
+TCGLabelQemuLdst *l = new_ldst_label(s);
+
+l->is_ld = is_ld;
+l->addrlo_reg = addrlo;
+l->addrhi_reg = addrhi;
+
+/* We are expecting a_bits to max out at 7, much lower than ANDI. */
+tcg_debug_assert(a_bits < 16);
+tcg_out_opc_imm(s, OPC_ANDI, TCG_TMP0, addrlo, a_mask);
+
+l->label_ptr[0] = s->code_ptr;
+if (use_mips32r6_instructions) {
+tcg_out_opc_br(s, OPC_BNEZALC_R6, TCG_REG_ZERO, TCG_TMP0);
+} else {
+tcg_out_opc_br(s, OPC_BNEL, TCG_TMP0, TCG_REG_ZERO);
+tcg_out_nop(s);
+}
+
+l->raddr = tcg_splitwx_to_rx(s->code_ptr);
+}
+
+static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
+{
+void *target;
+
+if (!reloc_pc16(l->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
+return false;
+}
+
+if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
+/* A0 is env, A1 is skipped, A2:A3 is the uint64_t address. */
+TCGReg a2 = MIPS_BE ? l->addrhi_reg : l->addrlo_reg;
+TCGReg a3 = MIPS_BE ? l->addrlo_reg : l->addrhi_reg;
+
+if (a3 != TCG_REG_A2) {
+tcg_out_mov(s, TCG_TYPE_I32, TCG_REG_A2, a2);
+tcg_out_mov(s, TCG_TYPE_I32, TCG_REG_A3, a3);
+} else if (a2 != TCG_REG_A3) {
+tcg_out_mov(s, TCG_TYPE_I32, TCG_REG_A3, a3);
+tcg_out_mov(s, TCG_TYPE_I32, TCG_REG_A2, a2);
+} else {
+tcg_out_mov(s, TCG_TYPE_I32, TCG_TMP0, TCG_REG_A2);
+tcg_out_mov(s, TCG_TYPE_I32, TCG_REG_A2, TCG_REG_A3);
+tcg_out_mov(s, TCG_TYPE_I32, TCG_REG_A3, TCG_TMP0);
+}
+} else {
+tcg_out_mov(s, TCG_TYPE_TL, TCG_REG_A1, l->addrlo_reg);
+}
+tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_A0, TCG_AREG0);
+
+/*
+ * Tail call to the helper, with the return address back inline.
+ * We have arrived here via BNEL, so $31 is already set.
+ */
+target = (l->is_ld ? helper_unaligned_ld : helper_unaligned_st);
+tcg_out_call_int(s, target, true);
+return true;
+}
+
+static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
+{
+return tcg_out_fail_alignment(s, l);
+}
+
+static bool tcg_out_qemu_st_slow_path(TCGConte

[PULL 15/34] tcg/tci: Support raising sigbus for user-only

2022-02-10 Thread Richard Henderson

Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Richard Henderson 
---
 tcg/tci.c | 20 ++--
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/tcg/tci.c b/tcg/tci.c
index 336af5945a..fe92b5d084 100644
--- a/tcg/tci.c
+++ b/tcg/tci.c
@@ -292,11 +292,11 @@ static bool tci_compare64(uint64_t u0, uint64_t u1, 
TCGCond condition)
 static uint64_t tci_qemu_ld(CPUArchState *env, target_ulong taddr,
 MemOpIdx oi, const void *tb_ptr)
 {
-MemOp mop = get_memop(oi) & (MO_BSWAP | MO_SSIZE);
+MemOp mop = get_memop(oi);
 uintptr_t ra = (uintptr_t)tb_ptr;
 
 #ifdef CONFIG_SOFTMMU
-switch (mop) {
+switch (mop & (MO_BSWAP | MO_SSIZE)) {
 case MO_UB:
 return helper_ret_ldub_mmu(env, taddr, oi, ra);
 case MO_SB:
@@ -326,10 +326,14 @@ static uint64_t tci_qemu_ld(CPUArchState *env, 
target_ulong taddr,
 }
 #else
 void *haddr = g2h(env_cpu(env), taddr);
+unsigned a_mask = (1u << get_alignment_bits(mop)) - 1;
 uint64_t ret;
 
 set_helper_retaddr(ra);
-switch (mop) {
+if (taddr & a_mask) {
+helper_unaligned_ld(env, taddr);
+}
+switch (mop & (MO_BSWAP | MO_SSIZE)) {
 case MO_UB:
 ret = ldub_p(haddr);
 break;
@@ -377,11 +381,11 @@ static uint64_t tci_qemu_ld(CPUArchState *env, 
target_ulong taddr,
 static void tci_qemu_st(CPUArchState *env, target_ulong taddr, uint64_t val,
 MemOpIdx oi, const void *tb_ptr)
 {
-MemOp mop = get_memop(oi) & (MO_BSWAP | MO_SSIZE);
+MemOp mop = get_memop(oi);
 uintptr_t ra = (uintptr_t)tb_ptr;
 
 #ifdef CONFIG_SOFTMMU
-switch (mop) {
+switch (mop & (MO_BSWAP | MO_SIZE)) {
 case MO_UB:
 helper_ret_stb_mmu(env, taddr, val, oi, ra);
 break;
@@ -408,9 +412,13 @@ static void tci_qemu_st(CPUArchState *env, target_ulong 
taddr, uint64_t val,
 }
 #else
 void *haddr = g2h(env_cpu(env), taddr);
+unsigned a_mask = (1u << get_alignment_bits(mop)) - 1;
 
 set_helper_retaddr(ra);
-switch (mop) {
+if (taddr & a_mask) {
+helper_unaligned_st(env, taddr);
+}
+switch (mop & (MO_BSWAP | MO_SIZE)) {
 case MO_UB:
 stb_p(haddr, val);
 break;
-- 
2.25.1

[PULL 19/34] tcg/arm: Remove use_armv6_instructions

2022-02-10 Thread Richard Henderson

This is now always true, since we require armv6.

Reviewed-by: Peter Maydell 
Signed-off-by: Richard Henderson 
---
 tcg/arm/tcg-target.h |   1 -
 tcg/arm/tcg-target.c.inc | 192 ++-
 2 files changed, 27 insertions(+), 166 deletions(-)

diff --git a/tcg/arm/tcg-target.h b/tcg/arm/tcg-target.h
index 5c9ba5feea..1dd4cd5377 100644
--- a/tcg/arm/tcg-target.h
+++ b/tcg/arm/tcg-target.h
@@ -28,7 +28,6 @@
 
 extern int arm_arch;
 
-#define use_armv6_instructions  (__ARM_ARCH >= 6 || arm_arch >= 6)
 #define use_armv7_instructions  (__ARM_ARCH >= 7 || arm_arch >= 7)
 
 #undef TCG_TARGET_STACK_GROWSUP
diff --git a/tcg/arm/tcg-target.c.inc b/tcg/arm/tcg-target.c.inc
index f3b635063f..9eb43407ea 100644
--- a/tcg/arm/tcg-target.c.inc
+++ b/tcg/arm/tcg-target.c.inc
@@ -923,17 +923,6 @@ static void tcg_out_dat_rIN(TCGContext *s, ARMCond cond, 
ARMInsn opc,
 static void tcg_out_mul32(TCGContext *s, ARMCond cond, TCGReg rd,
   TCGReg rn, TCGReg rm)
 {
-/* if ArchVersion() < 6 && d == n then UNPREDICTABLE;  */
-if (!use_armv6_instructions && rd == rn) {
-if (rd == rm) {
-/* rd == rn == rm; copy an input to tmp first.  */
-tcg_out_mov_reg(s, cond, TCG_REG_TMP, rn);
-rm = rn = TCG_REG_TMP;
-} else {
-rn = rm;
-rm = rd;
-}
-}
 /* mul */
 tcg_out32(s, (cond << 28) | 0x90 | (rd << 16) | (rm << 8) | rn);
 }
@@ -941,17 +930,6 @@ static void tcg_out_mul32(TCGContext *s, ARMCond cond, 
TCGReg rd,
 static void tcg_out_umull32(TCGContext *s, ARMCond cond, TCGReg rd0,
 TCGReg rd1, TCGReg rn, TCGReg rm)
 {
-/* if ArchVersion() < 6 && (dHi == n || dLo == n) then UNPREDICTABLE;  */
-if (!use_armv6_instructions && (rd0 == rn || rd1 == rn)) {
-if (rd0 == rm || rd1 == rm) {
-tcg_out_mov_reg(s, cond, TCG_REG_TMP, rn);
-rn = TCG_REG_TMP;
-} else {
-TCGReg t = rn;
-rn = rm;
-rm = t;
-}
-}
 /* umull */
 tcg_out32(s, (cond << 28) | 0x00800090 |
   (rd1 << 16) | (rd0 << 12) | (rm << 8) | rn);
@@ -960,17 +938,6 @@ static void tcg_out_umull32(TCGContext *s, ARMCond cond, 
TCGReg rd0,
 static void tcg_out_smull32(TCGContext *s, ARMCond cond, TCGReg rd0,
 TCGReg rd1, TCGReg rn, TCGReg rm)
 {
-/* if ArchVersion() < 6 && (dHi == n || dLo == n) then UNPREDICTABLE;  */
-if (!use_armv6_instructions && (rd0 == rn || rd1 == rn)) {
-if (rd0 == rm || rd1 == rm) {
-tcg_out_mov_reg(s, cond, TCG_REG_TMP, rn);
-rn = TCG_REG_TMP;
-} else {
-TCGReg t = rn;
-rn = rm;
-rm = t;
-}
-}
 /* smull */
 tcg_out32(s, (cond << 28) | 0x00c00090 |
   (rd1 << 16) | (rd0 << 12) | (rm << 8) | rn);
@@ -990,15 +957,8 @@ static void tcg_out_udiv(TCGContext *s, ARMCond cond,
 
 static void tcg_out_ext8s(TCGContext *s, ARMCond cond, TCGReg rd, TCGReg rn)
 {
-if (use_armv6_instructions) {
-/* sxtb */
-tcg_out32(s, 0x06af0070 | (cond << 28) | (rd << 12) | rn);
-} else {
-tcg_out_dat_reg(s, cond, ARITH_MOV,
-rd, 0, rn, SHIFT_IMM_LSL(24));
-tcg_out_dat_reg(s, cond, ARITH_MOV,
-rd, 0, rd, SHIFT_IMM_ASR(24));
-}
+/* sxtb */
+tcg_out32(s, 0x06af0070 | (cond << 28) | (rd << 12) | rn);
 }
 
 static void __attribute__((unused))
@@ -1009,113 +969,37 @@ tcg_out_ext8u(TCGContext *s, ARMCond cond, TCGReg rd, 
TCGReg rn)
 
 static void tcg_out_ext16s(TCGContext *s, ARMCond cond, TCGReg rd, TCGReg rn)
 {
-if (use_armv6_instructions) {
-/* sxth */
-tcg_out32(s, 0x06bf0070 | (cond << 28) | (rd << 12) | rn);
-} else {
-tcg_out_dat_reg(s, cond, ARITH_MOV,
-rd, 0, rn, SHIFT_IMM_LSL(16));
-tcg_out_dat_reg(s, cond, ARITH_MOV,
-rd, 0, rd, SHIFT_IMM_ASR(16));
-}
+/* sxth */
+tcg_out32(s, 0x06bf0070 | (cond << 28) | (rd << 12) | rn);
 }
 
 static void tcg_out_ext16u(TCGContext *s, ARMCond cond, TCGReg rd, TCGReg rn)
 {
-if (use_armv6_instructions) {
-/* uxth */
-tcg_out32(s, 0x06ff0070 | (cond << 28) | (rd << 12) | rn);
-} else {
-tcg_out_dat_reg(s, cond, ARITH_MOV,
-rd, 0, rn, SHIFT_IMM_LSL(16));
-tcg_out_dat_reg(s, cond, ARITH_MOV,
-rd, 0, rd, SHIFT_IMM_LSR(16));
-}
+/* uxth */
+tcg_out32(s, 0x06ff0070 | (cond << 28) | (rd << 12) | rn);
 }
 
 static void tcg_out_bswap16(TCGContext *s, ARMCond cond,
 TCGReg rd, TCGReg rn, int flags)
 {
-if (use_armv6_instructions) {
-if (flags & TCG_BSWAP_OS) {
-/* revsh */
-tcg_out32(s, 0x06ff0fb0 | (cond << 28) | (rd << 12) | rn);
-return;
-}
-
-

[PULL 09/34] tcg/loongarch64: Fix fallout from recent MO_Q renaming

2022-02-10 Thread Richard Henderson

From: WANG Xuerui 

Apparently we were left behind; just renaming MO_Q to MO_UQ is enough.

Fixes: fc313c64345453c7 ("exec/memop: Adding signedness to quad definitions")
Signed-off-by: WANG Xuerui 
Message-Id: <20220206162106.1092364-1-i.q...@xen0n.name>
Signed-off-by: Richard Henderson 
---
 tcg/loongarch64/tcg-target.c.inc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tcg/loongarch64/tcg-target.c.inc b/tcg/loongarch64/tcg-target.c.inc
index 9cd46c9be3..d31a0e5991 100644
--- a/tcg/loongarch64/tcg-target.c.inc
+++ b/tcg/loongarch64/tcg-target.c.inc
@@ -871,7 +871,7 @@ static void tcg_out_qemu_ld_indexed(TCGContext *s, TCGReg 
rd, TCGReg rj,
 case MO_SL:
 tcg_out_opc_ldx_w(s, rd, rj, rk);
 break;
-case MO_Q:
+case MO_UQ:
 tcg_out_opc_ldx_d(s, rd, rj, rk);
 break;
 default:
-- 
2.25.1

[PULL 16/34] tcg/loongarch64: Support raising sigbus for user-only

2022-02-10 Thread Richard Henderson

From: WANG Xuerui 

Signed-off-by: WANG Xuerui 
Reviewed-by: Richard Henderson 
Message-Id: <20220106134238.3936163-1-...@xen0n.name>
Signed-off-by: Richard Henderson 
---
 tcg/loongarch64/tcg-target.h |  2 -
 tcg/loongarch64/tcg-target.c.inc | 71 +++-
 2 files changed, 69 insertions(+), 4 deletions(-)

diff --git a/tcg/loongarch64/tcg-target.h b/tcg/loongarch64/tcg-target.h
index 05010805e7..d58a6162f2 100644
--- a/tcg/loongarch64/tcg-target.h
+++ b/tcg/loongarch64/tcg-target.h
@@ -171,9 +171,7 @@ void tb_target_set_jmp_target(uintptr_t, uintptr_t, 
uintptr_t, uintptr_t);
 
 #define TCG_TARGET_DEFAULT_MO (0)
 
-#ifdef CONFIG_SOFTMMU
 #define TCG_TARGET_NEED_LDST_LABELS
-#endif
 
 #define TCG_TARGET_HAS_MEMORY_BSWAP 0
 
diff --git a/tcg/loongarch64/tcg-target.c.inc b/tcg/loongarch64/tcg-target.c.inc
index d31a0e5991..a3debf6da7 100644
--- a/tcg/loongarch64/tcg-target.c.inc
+++ b/tcg/loongarch64/tcg-target.c.inc
@@ -29,6 +29,8 @@
  * THE SOFTWARE.
  */
 
+#include "../tcg-ldst.c.inc"
+
 #ifdef CONFIG_DEBUG_TCG
 static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
 "zero",
@@ -642,8 +644,6 @@ static bool tcg_out_sti(TCGContext *s, TCGType type, TCGArg 
val,
  */
 
 #if defined(CONFIG_SOFTMMU)
-#include "../tcg-ldst.c.inc"
-
 /*
  * helper signature: helper_ret_ld_mmu(CPUState *env, target_ulong addr,
  * MemOpIdx oi, uintptr_t ra)
@@ -825,6 +825,61 @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, 
TCGLabelQemuLdst *l)
 
 return tcg_out_goto(s, l->raddr);
 }
+#else
+
+/*
+ * Alignment helpers for user-mode emulation
+ */
+
+static void tcg_out_test_alignment(TCGContext *s, bool is_ld, TCGReg addr_reg,
+   unsigned a_bits)
+{
+TCGLabelQemuLdst *l = new_ldst_label(s);
+
+l->is_ld = is_ld;
+l->addrlo_reg = addr_reg;
+
+/*
+ * Without micro-architecture details, we don't know which of bstrpick or
+ * andi is faster, so use bstrpick as it's not constrained by imm field
+ * width. (Not to say alignments >= 2^12 are going to happen any time
+ * soon, though)
+ */
+tcg_out_opc_bstrpick_d(s, TCG_REG_TMP1, addr_reg, 0, a_bits - 1);
+
+l->label_ptr[0] = s->code_ptr;
+tcg_out_opc_bne(s, TCG_REG_TMP1, TCG_REG_ZERO, 0);
+
+l->raddr = tcg_splitwx_to_rx(s->code_ptr);
+}
+
+static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
+{
+/* resolve label address */
+if (!reloc_br_sk16(l->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
+return false;
+}
+
+tcg_out_mov(s, TCG_TYPE_TL, TCG_REG_A1, l->addrlo_reg);
+tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_A0, TCG_AREG0);
+
+/* tail call, with the return address back inline. */
+tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_RA, (uintptr_t)l->raddr);
+tcg_out_call_int(s, (const void *)(l->is_ld ? helper_unaligned_ld
+   : helper_unaligned_st), true);
+return true;
+}
+
+static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
+{
+return tcg_out_fail_alignment(s, l);
+}
+
+static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
+{
+return tcg_out_fail_alignment(s, l);
+}
+
 #endif /* CONFIG_SOFTMMU */
 
 /*
@@ -887,6 +942,8 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg 
*args, TCGType type)
 MemOp opc;
 #if defined(CONFIG_SOFTMMU)
 tcg_insn_unit *label_ptr[1];
+#else
+unsigned a_bits;
 #endif
 TCGReg base;
 
@@ -903,6 +960,10 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg 
*args, TCGType type)
 data_regl, addr_regl,
 s->code_ptr, label_ptr);
 #else
+a_bits = get_alignment_bits(opc);
+if (a_bits) {
+tcg_out_test_alignment(s, true, addr_regl, a_bits);
+}
 base = tcg_out_zext_addr_if_32_bit(s, addr_regl, TCG_REG_TMP0);
 TCGReg guest_base_reg = USE_GUEST_BASE ? TCG_GUEST_BASE_REG : TCG_REG_ZERO;
 tcg_out_qemu_ld_indexed(s, data_regl, base, guest_base_reg, opc, type);
@@ -941,6 +1002,8 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg 
*args)
 MemOp opc;
 #if defined(CONFIG_SOFTMMU)
 tcg_insn_unit *label_ptr[1];
+#else
+unsigned a_bits;
 #endif
 TCGReg base;
 
@@ -958,6 +1021,10 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg 
*args)
 data_regl, addr_regl,
 s->code_ptr, label_ptr);
 #else
+a_bits = get_alignment_bits(opc);
+if (a_bits) {
+tcg_out_test_alignment(s, false, addr_regl, a_bits);
+}
 base = tcg_out_zext_addr_if_32_bit(s, addr_regl, TCG_REG_TMP0);
 TCGReg guest_base_reg = USE_GUEST_BASE ? TCG_GUEST_BASE_REG : TCG_REG_ZERO;
 tcg_out_qemu_st_indexed(s, data_regl, base, guest_base_reg, opc);
-- 
2.25.1

[PULL 05/34] linux-user/include/host/sparc64: Fix host_sigcontext

2022-02-10 Thread Richard Henderson

Sparc64 is unique on linux in *not* passing ucontext_t as
the third argument to a SA_SIGINFO handler.  It passes the
old struct sigcontext instead.

Set both pc and npc in host_signal_set_pc.

Fixes: 8b5bd461935b ("linux-user/host/sparc: Populate host_signal.h")
Reviewed-by: Peter Maydell 
Signed-off-by: Richard Henderson 
---
 linux-user/include/host/sparc64/host-signal.h | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/linux-user/include/host/sparc64/host-signal.h 
b/linux-user/include/host/sparc64/host-signal.h
index f8a8a4908d..64957c2bca 100644
--- a/linux-user/include/host/sparc64/host-signal.h
+++ b/linux-user/include/host/sparc64/host-signal.h
@@ -11,22 +11,23 @@
 #ifndef SPARC64_HOST_SIGNAL_H
 #define SPARC64_HOST_SIGNAL_H
 
-/* FIXME: the third argument to a SA_SIGINFO handler is *not* ucontext_t. */
-typedef ucontext_t host_sigcontext;
+/* The third argument to a SA_SIGINFO handler is struct sigcontext.  */
+typedef struct sigcontext host_sigcontext;
 
-static inline uintptr_t host_signal_pc(host_sigcontext *uc)
+static inline uintptr_t host_signal_pc(host_sigcontext *sc)
 {
-return uc->uc_mcontext.mc_gregs[MC_PC];
+return sc->sigc_regs.tpc;
 }
 
-static inline void host_signal_set_pc(host_sigcontext *uc, uintptr_t pc)
+static inline void host_signal_set_pc(host_sigcontext *sc, uintptr_t pc)
 {
-uc->uc_mcontext.mc_gregs[MC_PC] = pc;
+sc->sigc_regs.tpc = pc;
+sc->sigc_regs.tnpc = pc + 4;
 }
 
-static inline void *host_signal_mask(host_sigcontext *uc)
+static inline void *host_signal_mask(host_sigcontext *sc)
 {
-return &uc->uc_sigmask;
+return &sc->sigc_mask;
 }
 
 static inline bool host_signal_write(siginfo_t *info, host_sigcontext *uc)
-- 
2.25.1

[PULL 03/34] linux-user: Introduce host_sigcontext

2022-02-10 Thread Richard Henderson

Do not directly access ucontext_t as the third signal parameter.
This is preparation for a sparc64 fix.

Reviewed-by: Peter Maydell 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Richard Henderson 
---
 linux-user/include/host/aarch64/host-signal.h | 13 -
 linux-user/include/host/alpha/host-signal.h   | 11 +++
 linux-user/include/host/arm/host-signal.h | 11 +++
 linux-user/include/host/i386/host-signal.h| 11 +++
 linux-user/include/host/loongarch64/host-signal.h | 11 +++
 linux-user/include/host/mips/host-signal.h| 11 +++
 linux-user/include/host/ppc/host-signal.h | 11 +++
 linux-user/include/host/riscv/host-signal.h   | 11 +++
 linux-user/include/host/s390/host-signal.h| 11 +++
 linux-user/include/host/sparc/host-signal.h   | 11 +++
 linux-user/include/host/x86_64/host-signal.h  | 11 +++
 linux-user/signal.c   |  4 ++--
 12 files changed, 80 insertions(+), 47 deletions(-)

diff --git a/linux-user/include/host/aarch64/host-signal.h 
b/linux-user/include/host/aarch64/host-signal.h
index 76ab078069..be079684a2 100644
--- a/linux-user/include/host/aarch64/host-signal.h
+++ b/linux-user/include/host/aarch64/host-signal.h
@@ -11,6 +11,9 @@
 #ifndef AARCH64_HOST_SIGNAL_H
 #define AARCH64_HOST_SIGNAL_H
 
+/* The third argument to a SA_SIGINFO handler is ucontext_t. */
+typedef ucontext_t host_sigcontext;
+
 /* Pre-3.16 kernel headers don't have these, so provide fallback definitions */
 #ifndef ESR_MAGIC
 #define ESR_MAGIC 0x45535201
@@ -20,7 +23,7 @@ struct esr_context {
 };
 #endif
 
-static inline struct _aarch64_ctx *first_ctx(ucontext_t *uc)
+static inline struct _aarch64_ctx *first_ctx(host_sigcontext *uc)
 {
 return (struct _aarch64_ctx *)&uc->uc_mcontext.__reserved;
 }
@@ -30,22 +33,22 @@ static inline struct _aarch64_ctx *next_ctx(struct 
_aarch64_ctx *hdr)
 return (struct _aarch64_ctx *)((char *)hdr + hdr->size);
 }
 
-static inline uintptr_t host_signal_pc(ucontext_t *uc)
+static inline uintptr_t host_signal_pc(host_sigcontext *uc)
 {
 return uc->uc_mcontext.pc;
 }
 
-static inline void host_signal_set_pc(ucontext_t *uc, uintptr_t pc)
+static inline void host_signal_set_pc(host_sigcontext *uc, uintptr_t pc)
 {
 uc->uc_mcontext.pc = pc;
 }
 
-static inline void *host_signal_mask(ucontext_t *uc)
+static inline void *host_signal_mask(host_sigcontext *uc)
 {
 return &uc->uc_sigmask;
 }
 
-static inline bool host_signal_write(siginfo_t *info, ucontext_t *uc)
+static inline bool host_signal_write(siginfo_t *info, host_sigcontext *uc)
 {
 struct _aarch64_ctx *hdr;
 uint32_t insn;
diff --git a/linux-user/include/host/alpha/host-signal.h 
b/linux-user/include/host/alpha/host-signal.h
index a44d670f2b..4f9e2abc4b 100644
--- a/linux-user/include/host/alpha/host-signal.h
+++ b/linux-user/include/host/alpha/host-signal.h
@@ -11,22 +11,25 @@
 #ifndef ALPHA_HOST_SIGNAL_H
 #define ALPHA_HOST_SIGNAL_H
 
-static inline uintptr_t host_signal_pc(ucontext_t *uc)
+/* The third argument to a SA_SIGINFO handler is ucontext_t. */
+typedef ucontext_t host_sigcontext;
+
+static inline uintptr_t host_signal_pc(host_sigcontext *uc)
 {
 return uc->uc_mcontext.sc_pc;
 }
 
-static inline void host_signal_set_pc(ucontext_t *uc, uintptr_t pc)
+static inline void host_signal_set_pc(host_sigcontext *uc, uintptr_t pc)
 {
 uc->uc_mcontext.sc_pc = pc;
 }
 
-static inline void *host_signal_mask(ucontext_t *uc)
+static inline void *host_signal_mask(host_sigcontext *uc)
 {
 return &uc->uc_sigmask;
 }
 
-static inline bool host_signal_write(siginfo_t *info, ucontext_t *uc)
+static inline bool host_signal_write(siginfo_t *info, host_sigcontext *uc)
 {
 uint32_t *pc = (uint32_t *)host_signal_pc(uc);
 uint32_t insn = *pc;
diff --git a/linux-user/include/host/arm/host-signal.h 
b/linux-user/include/host/arm/host-signal.h
index bbeb4ffefb..faba496d24 100644
--- a/linux-user/include/host/arm/host-signal.h
+++ b/linux-user/include/host/arm/host-signal.h
@@ -11,22 +11,25 @@
 #ifndef ARM_HOST_SIGNAL_H
 #define ARM_HOST_SIGNAL_H
 
-static inline uintptr_t host_signal_pc(ucontext_t *uc)
+/* The third argument to a SA_SIGINFO handler is ucontext_t. */
+typedef ucontext_t host_sigcontext;
+
+static inline uintptr_t host_signal_pc(host_sigcontext *uc)
 {
 return uc->uc_mcontext.arm_pc;
 }
 
-static inline void host_signal_set_pc(ucontext_t *uc, uintptr_t pc)
+static inline void host_signal_set_pc(host_sigcontext *uc, uintptr_t pc)
 {
 uc->uc_mcontext.arm_pc = pc;
 }
 
-static inline void *host_signal_mask(ucontext_t *uc)
+static inline void *host_signal_mask(host_sigcontext *uc)
 {
 return &uc->uc_sigmask;
 }
 
-static inline bool host_signal_write(siginfo_t *info, ucontext_t *uc)
+static inline bool host_signal_write(siginfo_t *info, host_sigcontext *uc)
 {
 /*
  * In the FSR, bit 11 is WnR, assuming a v6 or

[PULL 12/34] tcg/ppc: Support raising sigbus for user-only

2022-02-10 Thread Richard Henderson

Reviewed-by: Peter Maydell 
Signed-off-by: Richard Henderson 
---
 tcg/ppc/tcg-target.h |  2 -
 tcg/ppc/tcg-target.c.inc | 98 
 2 files changed, 90 insertions(+), 10 deletions(-)

diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
index 0943192cde..c775c97b61 100644
--- a/tcg/ppc/tcg-target.h
+++ b/tcg/ppc/tcg-target.h
@@ -182,9 +182,7 @@ void tb_target_set_jmp_target(uintptr_t, uintptr_t, 
uintptr_t, uintptr_t);
 #define TCG_TARGET_DEFAULT_MO (0)
 #define TCG_TARGET_HAS_MEMORY_BSWAP 1
 
-#ifdef CONFIG_SOFTMMU
 #define TCG_TARGET_NEED_LDST_LABELS
-#endif
 #define TCG_TARGET_NEED_POOL_LABELS
 
 #endif
diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
index 9e79a7edee..dea24f23c4 100644
--- a/tcg/ppc/tcg-target.c.inc
+++ b/tcg/ppc/tcg-target.c.inc
@@ -24,6 +24,7 @@
 
 #include "elf.h"
 #include "../tcg-pool.c.inc"
+#include "../tcg-ldst.c.inc"
 
 /*
  * Standardize on the _CALL_FOO symbols used by GCC:
@@ -1881,7 +1882,8 @@ void tb_target_set_jmp_target(uintptr_t tc_ptr, uintptr_t 
jmp_rx,
 }
 }
 
-static void tcg_out_call(TCGContext *s, const tcg_insn_unit *target)
+static void tcg_out_call_int(TCGContext *s, int lk,
+ const tcg_insn_unit *target)
 {
 #ifdef _CALL_AIX
 /* Look through the descriptor.  If the branch is in range, and we
@@ -1892,7 +1894,7 @@ static void tcg_out_call(TCGContext *s, const 
tcg_insn_unit *target)
 
 if (in_range_b(diff) && toc == (uint32_t)toc) {
 tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_TMP1, toc);
-tcg_out_b(s, LK, tgt);
+tcg_out_b(s, lk, tgt);
 } else {
 /* Fold the low bits of the constant into the addresses below.  */
 intptr_t arg = (intptr_t)target;
@@ -1907,7 +1909,7 @@ static void tcg_out_call(TCGContext *s, const 
tcg_insn_unit *target)
 tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_R0, TCG_REG_TMP1, ofs);
 tcg_out32(s, MTSPR | RA(TCG_REG_R0) | CTR);
 tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_R2, TCG_REG_TMP1, ofs + SZP);
-tcg_out32(s, BCCTR | BO_ALWAYS | LK);
+tcg_out32(s, BCCTR | BO_ALWAYS | lk);
 }
 #elif defined(_CALL_ELF) && _CALL_ELF == 2
 intptr_t diff;
@@ -1921,16 +1923,21 @@ static void tcg_out_call(TCGContext *s, const 
tcg_insn_unit *target)
 
 diff = tcg_pcrel_diff(s, target);
 if (in_range_b(diff)) {
-tcg_out_b(s, LK, target);
+tcg_out_b(s, lk, target);
 } else {
 tcg_out32(s, MTSPR | RS(TCG_REG_R12) | CTR);
-tcg_out32(s, BCCTR | BO_ALWAYS | LK);
+tcg_out32(s, BCCTR | BO_ALWAYS | lk);
 }
 #else
-tcg_out_b(s, LK, target);
+tcg_out_b(s, lk, target);
 #endif
 }
 
+static void tcg_out_call(TCGContext *s, const tcg_insn_unit *target)
+{
+tcg_out_call_int(s, LK, target);
+}
+
 static const uint32_t qemu_ldx_opc[(MO_SSIZE + MO_BSWAP) + 1] = {
 [MO_UB] = LBZX,
 [MO_UW] = LHZX,
@@ -1960,8 +1967,6 @@ static const uint32_t qemu_exts_opc[4] = {
 };
 
 #if defined (CONFIG_SOFTMMU)
-#include "../tcg-ldst.c.inc"
-
 /* helper signature: helper_ld_mmu(CPUState *env, target_ulong addr,
  * int mmu_idx, uintptr_t ra)
  */
@@ -2227,6 +2232,71 @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, 
TCGLabelQemuLdst *lb)
 tcg_out_b(s, 0, lb->raddr);
 return true;
 }
+#else
+
+static void tcg_out_test_alignment(TCGContext *s, bool is_ld, TCGReg addrlo,
+   TCGReg addrhi, unsigned a_bits)
+{
+unsigned a_mask = (1 << a_bits) - 1;
+TCGLabelQemuLdst *label = new_ldst_label(s);
+
+label->is_ld = is_ld;
+label->addrlo_reg = addrlo;
+label->addrhi_reg = addrhi;
+
+/* We are expecting a_bits to max out at 7, much lower than ANDI. */
+tcg_debug_assert(a_bits < 16);
+tcg_out32(s, ANDI | SAI(addrlo, TCG_REG_R0, a_mask));
+
+label->label_ptr[0] = s->code_ptr;
+tcg_out32(s, BC | BI(0, CR_EQ) | BO_COND_FALSE | LK);
+
+label->raddr = tcg_splitwx_to_rx(s->code_ptr);
+}
+
+static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
+{
+if (!reloc_pc14(l->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
+return false;
+}
+
+if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
+TCGReg arg = TCG_REG_R4;
+#ifdef TCG_TARGET_CALL_ALIGN_ARGS
+arg |= 1;
+#endif
+if (l->addrlo_reg != arg) {
+tcg_out_mov(s, TCG_TYPE_I32, arg, l->addrhi_reg);
+tcg_out_mov(s, TCG_TYPE_I32, arg + 1, l->addrlo_reg);
+} else if (l->addrhi_reg != arg + 1) {
+tcg_out_mov(s, TCG_TYPE_I32, arg + 1, l->addrlo_reg);
+tcg_out_mov(s, TCG_TYPE_I32, arg, l->addrhi_reg);
+} else {
+tcg_out_mov(s, TCG_TYPE_I32, TCG_REG_R0, arg);
+tcg_out_mov(s, TCG_TYPE_I32, arg, arg + 1);
+tcg_out_mov(s, TCG_TYPE_I32, arg + 1, TCG_REG_R0);
+}
+} else {
+tcg_out_mov(s, TCG_TYPE_TL, TCG_REG_R4, l->addrlo_reg);
+}
+t

[PULL 06/34] accel/tcg: Optimize jump cache flush during tlb range flush

2022-02-10 Thread Richard Henderson

From: Idan Horowitz 

When the length of the range is large enough, clearing the whole cache is
faster than iterating over the (possibly extremely large) set of pages
contained in the range.

This mimics the pre-existing similar optimization done on the flush of the
tlb itself.

Signed-off-by: Idan Horowitz 
Message-Id: <20220110164754.1066025-1-idan.horow...@gmail.com>
Reviewed-by: Richard Henderson 
Signed-off-by: Richard Henderson 
---
 accel/tcg/cputlb.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 5e0d0eebc3..926d9a9192 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -783,6 +783,15 @@ static void tlb_flush_range_by_mmuidx_async_0(CPUState 
*cpu,
 }
 qemu_spin_unlock(&env_tlb(env)->c.lock);
 
+/*
+ * If the length is larger than the jump cache size, then it will take
+ * longer to clear each entry individually than it will to clear it all.
+ */
+if (d.len >= (TARGET_PAGE_SIZE * TB_JMP_CACHE_SIZE)) {
+cpu_tb_jmp_cache_clear(cpu);
+return;
+}
+
 for (target_ulong i = 0; i < d.len; i += TARGET_PAGE_SIZE) {
 tb_flush_jmp_cache(cpu, d.addr + i);
 }
-- 
2.25.1

[PULL 02/34] linux-user: Introduce host_signal_mask

2022-02-10 Thread Richard Henderson

Do not directly access the uc_sigmask member.
This is preparation for a sparc64 fix.

Reviewed-by: Peter Maydell 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Richard Henderson 
---
 linux-user/include/host/aarch64/host-signal.h  |  5 +
 linux-user/include/host/alpha/host-signal.h|  5 +
 linux-user/include/host/arm/host-signal.h  |  5 +
 linux-user/include/host/i386/host-signal.h |  5 +
 .../include/host/loongarch64/host-signal.h |  5 +
 linux-user/include/host/mips/host-signal.h |  5 +
 linux-user/include/host/ppc/host-signal.h  |  5 +
 linux-user/include/host/riscv/host-signal.h|  5 +
 linux-user/include/host/s390/host-signal.h |  5 +
 linux-user/include/host/sparc/host-signal.h|  5 +
 linux-user/include/host/x86_64/host-signal.h   |  5 +
 linux-user/signal.c| 18 --
 12 files changed, 63 insertions(+), 10 deletions(-)

diff --git a/linux-user/include/host/aarch64/host-signal.h 
b/linux-user/include/host/aarch64/host-signal.h
index 9770b36dc1..76ab078069 100644
--- a/linux-user/include/host/aarch64/host-signal.h
+++ b/linux-user/include/host/aarch64/host-signal.h
@@ -40,6 +40,11 @@ static inline void host_signal_set_pc(ucontext_t *uc, 
uintptr_t pc)
 uc->uc_mcontext.pc = pc;
 }
 
+static inline void *host_signal_mask(ucontext_t *uc)
+{
+return &uc->uc_sigmask;
+}
+
 static inline bool host_signal_write(siginfo_t *info, ucontext_t *uc)
 {
 struct _aarch64_ctx *hdr;
diff --git a/linux-user/include/host/alpha/host-signal.h 
b/linux-user/include/host/alpha/host-signal.h
index f4c942948a..a44d670f2b 100644
--- a/linux-user/include/host/alpha/host-signal.h
+++ b/linux-user/include/host/alpha/host-signal.h
@@ -21,6 +21,11 @@ static inline void host_signal_set_pc(ucontext_t *uc, 
uintptr_t pc)
 uc->uc_mcontext.sc_pc = pc;
 }
 
+static inline void *host_signal_mask(ucontext_t *uc)
+{
+return &uc->uc_sigmask;
+}
+
 static inline bool host_signal_write(siginfo_t *info, ucontext_t *uc)
 {
 uint32_t *pc = (uint32_t *)host_signal_pc(uc);
diff --git a/linux-user/include/host/arm/host-signal.h 
b/linux-user/include/host/arm/host-signal.h
index 6c095773c0..bbeb4ffefb 100644
--- a/linux-user/include/host/arm/host-signal.h
+++ b/linux-user/include/host/arm/host-signal.h
@@ -21,6 +21,11 @@ static inline void host_signal_set_pc(ucontext_t *uc, 
uintptr_t pc)
 uc->uc_mcontext.arm_pc = pc;
 }
 
+static inline void *host_signal_mask(ucontext_t *uc)
+{
+return &uc->uc_sigmask;
+}
+
 static inline bool host_signal_write(siginfo_t *info, ucontext_t *uc)
 {
 /*
diff --git a/linux-user/include/host/i386/host-signal.h 
b/linux-user/include/host/i386/host-signal.h
index abe1ece5c9..fd36f06bda 100644
--- a/linux-user/include/host/i386/host-signal.h
+++ b/linux-user/include/host/i386/host-signal.h
@@ -21,6 +21,11 @@ static inline void host_signal_set_pc(ucontext_t *uc, 
uintptr_t pc)
 uc->uc_mcontext.gregs[REG_EIP] = pc;
 }
 
+static inline void *host_signal_mask(ucontext_t *uc)
+{
+return &uc->uc_sigmask;
+}
+
 static inline bool host_signal_write(siginfo_t *info, ucontext_t *uc)
 {
 return uc->uc_mcontext.gregs[REG_TRAPNO] == 0xe
diff --git a/linux-user/include/host/loongarch64/host-signal.h 
b/linux-user/include/host/loongarch64/host-signal.h
index 7effa24251..a9dfe0c688 100644
--- a/linux-user/include/host/loongarch64/host-signal.h
+++ b/linux-user/include/host/loongarch64/host-signal.h
@@ -21,6 +21,11 @@ static inline void host_signal_set_pc(ucontext_t *uc, 
uintptr_t pc)
 uc->uc_mcontext.__pc = pc;
 }
 
+static inline void *host_signal_mask(ucontext_t *uc)
+{
+return &uc->uc_sigmask;
+}
+
 static inline bool host_signal_write(siginfo_t *info, ucontext_t *uc)
 {
 const uint32_t *pinsn = (const uint32_t *)host_signal_pc(uc);
diff --git a/linux-user/include/host/mips/host-signal.h 
b/linux-user/include/host/mips/host-signal.h
index c666ed8c3f..ff840dd491 100644
--- a/linux-user/include/host/mips/host-signal.h
+++ b/linux-user/include/host/mips/host-signal.h
@@ -21,6 +21,11 @@ static inline void host_signal_set_pc(ucontext_t *uc, 
uintptr_t pc)
 uc->uc_mcontext.pc = pc;
 }
 
+static inline void *host_signal_mask(ucontext_t *uc)
+{
+return &uc->uc_sigmask;
+}
+
 #if defined(__misp16) || defined(__mips_micromips)
 #error "Unsupported encoding"
 #endif
diff --git a/linux-user/include/host/ppc/host-signal.h 
b/linux-user/include/host/ppc/host-signal.h
index 1d8e658ff7..730a321d98 100644
--- a/linux-user/include/host/ppc/host-signal.h
+++ b/linux-user/include/host/ppc/host-signal.h
@@ -21,6 +21,11 @@ static inline void host_signal_set_pc(ucontext_t *uc, 
uintptr_t pc)
 uc->uc_mcontext.regs->nip = pc;
 }
 
+static inline void *host_signal_mask(ucontext_t *uc)
+{
+return &uc->uc_sigmask;
+}
+
 static inline bool host_signal_write(siginfo_t *info, ucontext_t *uc)
 {
 return uc->uc_mcontext.regs->trap != 0x400
diff --git a/linux-user/inc

[PULL 04/34] linux-user: Move sparc/host-signal.h to sparc64/host-signal.h

2022-02-10 Thread Richard Henderson

We do not support sparc32 as a host, so there's no point in
sparc64 redirecting to sparc.

Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Richard Henderson 
---
 linux-user/include/host/sparc/host-signal.h   | 71 ---
 linux-user/include/host/sparc64/host-signal.h | 64 -
 2 files changed, 63 insertions(+), 72 deletions(-)
 delete mode 100644 linux-user/include/host/sparc/host-signal.h

diff --git a/linux-user/include/host/sparc/host-signal.h 
b/linux-user/include/host/sparc/host-signal.h
deleted file mode 100644
index 871b6bb269..00
--- a/linux-user/include/host/sparc/host-signal.h
+++ /dev/null
@@ -1,71 +0,0 @@
-/*
- * host-signal.h: signal info dependent on the host architecture
- *
- * Copyright (c) 2003-2005 Fabrice Bellard
- * Copyright (c) 2021 Linaro Limited
- *
- * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
- * See the COPYING file in the top-level directory.
- */
-
-#ifndef SPARC_HOST_SIGNAL_H
-#define SPARC_HOST_SIGNAL_H
-
-/* FIXME: the third argument to a SA_SIGINFO handler is *not* ucontext_t. */
-typedef ucontext_t host_sigcontext;
-
-static inline uintptr_t host_signal_pc(host_sigcontext *uc)
-{
-#ifdef __arch64__
-return uc->uc_mcontext.mc_gregs[MC_PC];
-#else
-return uc->uc_mcontext.gregs[REG_PC];
-#endif
-}
-
-static inline void host_signal_set_pc(host_sigcontext *uc, uintptr_t pc)
-{
-#ifdef __arch64__
-uc->uc_mcontext.mc_gregs[MC_PC] = pc;
-#else
-uc->uc_mcontext.gregs[REG_PC] = pc;
-#endif
-}
-
-static inline void *host_signal_mask(host_sigcontext *uc)
-{
-return &uc->uc_sigmask;
-}
-
-static inline bool host_signal_write(siginfo_t *info, host_sigcontext *uc)
-{
-uint32_t insn = *(uint32_t *)host_signal_pc(uc);
-
-if ((insn >> 30) == 3) {
-switch ((insn >> 19) & 0x3f) {
-case 0x05: /* stb */
-case 0x15: /* stba */
-case 0x06: /* sth */
-case 0x16: /* stha */
-case 0x04: /* st */
-case 0x14: /* sta */
-case 0x07: /* std */
-case 0x17: /* stda */
-case 0x0e: /* stx */
-case 0x1e: /* stxa */
-case 0x24: /* stf */
-case 0x34: /* stfa */
-case 0x27: /* stdf */
-case 0x37: /* stdfa */
-case 0x26: /* stqf */
-case 0x36: /* stqfa */
-case 0x25: /* stfsr */
-case 0x3c: /* casa */
-case 0x3e: /* casxa */
-return true;
-}
-}
-return false;
-}
-
-#endif
diff --git a/linux-user/include/host/sparc64/host-signal.h 
b/linux-user/include/host/sparc64/host-signal.h
index 1191fe2d40..f8a8a4908d 100644
--- a/linux-user/include/host/sparc64/host-signal.h
+++ b/linux-user/include/host/sparc64/host-signal.h
@@ -1 +1,63 @@
-#include "../sparc/host-signal.h"
+/*
+ * host-signal.h: signal info dependent on the host architecture
+ *
+ * Copyright (c) 2003-2005 Fabrice Bellard
+ * Copyright (c) 2021 Linaro Limited
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef SPARC64_HOST_SIGNAL_H
+#define SPARC64_HOST_SIGNAL_H
+
+/* FIXME: the third argument to a SA_SIGINFO handler is *not* ucontext_t. */
+typedef ucontext_t host_sigcontext;
+
+static inline uintptr_t host_signal_pc(host_sigcontext *uc)
+{
+return uc->uc_mcontext.mc_gregs[MC_PC];
+}
+
+static inline void host_signal_set_pc(host_sigcontext *uc, uintptr_t pc)
+{
+uc->uc_mcontext.mc_gregs[MC_PC] = pc;
+}
+
+static inline void *host_signal_mask(host_sigcontext *uc)
+{
+return &uc->uc_sigmask;
+}
+
+static inline bool host_signal_write(siginfo_t *info, host_sigcontext *uc)
+{
+uint32_t insn = *(uint32_t *)host_signal_pc(uc);
+
+if ((insn >> 30) == 3) {
+switch ((insn >> 19) & 0x3f) {
+case 0x05: /* stb */
+case 0x15: /* stba */
+case 0x06: /* sth */
+case 0x16: /* stha */
+case 0x04: /* st */
+case 0x14: /* sta */
+case 0x07: /* std */
+case 0x17: /* stda */
+case 0x0e: /* stx */
+case 0x1e: /* stxa */
+case 0x24: /* stf */
+case 0x34: /* stfa */
+case 0x27: /* stdf */
+case 0x37: /* stdfa */
+case 0x26: /* stqf */
+case 0x36: /* stqfa */
+case 0x25: /* stfsr */
+case 0x3c: /* casa */
+case 0x3e: /* casxa */
+return true;
+}
+}
+return false;
+}
+
+#endif
-- 
2.25.1

[PULL 01/34] common-user/host/sparc64: Fix safe_syscall_base

2022-02-10 Thread Richard Henderson

Use the "retl" instead of "ret" instruction alias, since we
do not allocate a register window in this function.

Fix the offset to the first stacked parameter, which lies
beyond the register window save area.

Fixes: 95c021dac835 ("linux-user/host/sparc64: Add safe-syscall.inc.S")
Signed-off-by: Richard Henderson 
---
 common-user/host/sparc64/safe-syscall.inc.S | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/common-user/host/sparc64/safe-syscall.inc.S 
b/common-user/host/sparc64/safe-syscall.inc.S
index a2f2b9c967..c7be8f2d25 100644
--- a/common-user/host/sparc64/safe-syscall.inc.S
+++ b/common-user/host/sparc64/safe-syscall.inc.S
@@ -24,7 +24,8 @@
 .type   safe_syscall_end, @function
 
 #define STACK_BIAS  2047
-#define PARAM(N)STACK_BIAS + N*8
+#define WINDOW_SIZE 16 * 8
+#define PARAM(N)STACK_BIAS + WINDOW_SIZE + N * 8
 
 /*
  * This is the entry point for making a system call. The calling
@@ -74,7 +75,7 @@ safe_syscall_end:
 /* code path for having successfully executed the syscall */
 bcs,pn  %xcc, 1f
  nop
-ret
+retl
  nop
 
 /* code path when we didn't execute the syscall */
-- 
2.25.1

[PULL 00/34] tcg patch queue

2022-02-10 Thread Richard Henderson

The following changes since commit 0a301624c2f4ced3331ffd5bce85b4274fe132af:

  Merge remote-tracking branch 'remotes/pmaydell/tags/pull-target-arm-20220208' 
into staging (2022-02-08 11:40:08 +)

are available in the Git repository at:

  https://gitlab.com/rth7680/qemu.git tags/pull-tcg-20220211

for you to fetch changes up to 5c1a101ef6b85537a4ade93c39ea81cadd5c246e:

  tests/tcg/multiarch: Add sigbus.c (2022-02-09 09:00:01 +1100)


Fix safe_syscall_base for sparc64.
Fix host signal handling for sparc64-linux.
Speedups for jump cache and work list probing.
Fix for exception replays.
Raise guest SIGBUS for user-only misaligned accesses.


Idan Horowitz (2):
  accel/tcg: Optimize jump cache flush during tlb range flush
  softmmu/cpus: Check if the cpu work list is empty atomically

Pavel Dovgalyuk (1):
  replay: use CF_NOIRQ for special exception-replaying TB

Richard Henderson (29):
  common-user/host/sparc64: Fix safe_syscall_base
  linux-user: Introduce host_signal_mask
  linux-user: Introduce host_sigcontext
  linux-user: Move sparc/host-signal.h to sparc64/host-signal.h
  linux-user/include/host/sparc64: Fix host_sigcontext
  tcg/i386: Support raising sigbus for user-only
  tcg/aarch64: Support raising sigbus for user-only
  tcg/ppc: Support raising sigbus for user-only
  tcg/riscv: Support raising sigbus for user-only
  tcg/s390x: Support raising sigbus for user-only
  tcg/tci: Support raising sigbus for user-only
  tcg/arm: Drop support for armv4 and armv5 hosts
  tcg/arm: Remove use_armv5t_instructions
  tcg/arm: Remove use_armv6_instructions
  tcg/arm: Check alignment for ldrd and strd
  tcg/arm: Support unaligned access for softmmu
  tcg/arm: Reserve a register for guest_base
  tcg/arm: Support raising sigbus for user-only
  tcg/mips: Support unaligned access for user-only
  tcg/mips: Support unaligned access for softmmu
  tcg/sparc: Use tcg_out_movi_imm13 in tcg_out_addsub2_i64
  tcg/sparc: Split out tcg_out_movi_imm32
  tcg/sparc: Add scratch argument to tcg_out_movi_int
  tcg/sparc: Improve code gen for shifted 32-bit constants
  tcg/sparc: Convert patch_reloc to return bool
  tcg/sparc: Use the constant pool for 64-bit constants
  tcg/sparc: Add tcg_out_jmpl_const for better tail calls
  tcg/sparc: Support unaligned access for user-only
  tests/tcg/multiarch: Add sigbus.c

WANG Xuerui (2):
  tcg/loongarch64: Fix fallout from recent MO_Q renaming
  tcg/loongarch64: Support raising sigbus for user-only

 linux-user/include/host/aarch64/host-signal.h |  16 +-
 linux-user/include/host/alpha/host-signal.h   |  14 +-
 linux-user/include/host/arm/host-signal.h |  14 +-
 linux-user/include/host/i386/host-signal.h|  14 +-
 linux-user/include/host/loongarch64/host-signal.h |  14 +-
 linux-user/include/host/mips/host-signal.h|  14 +-
 linux-user/include/host/ppc/host-signal.h |  14 +-
 linux-user/include/host/riscv/host-signal.h   |  14 +-
 linux-user/include/host/s390/host-signal.h|  14 +-
 linux-user/include/host/sparc/host-signal.h   |  63 
 linux-user/include/host/sparc64/host-signal.h |  65 +++-
 linux-user/include/host/x86_64/host-signal.h  |  14 +-
 tcg/aarch64/tcg-target.h  |   2 -
 tcg/arm/tcg-target.h  |   6 +-
 tcg/i386/tcg-target.h |   2 -
 tcg/loongarch64/tcg-target.h  |   2 -
 tcg/mips/tcg-target.h |   2 -
 tcg/ppc/tcg-target.h  |   2 -
 tcg/riscv/tcg-target.h|   2 -
 tcg/s390x/tcg-target.h|   2 -
 accel/tcg/cpu-exec.c  |   3 +-
 accel/tcg/cputlb.c|   9 +
 linux-user/signal.c   |  22 +-
 softmmu/cpus.c|   7 +-
 tcg/tci.c |  20 +-
 tests/tcg/multiarch/sigbus.c  |  68 
 tcg/aarch64/tcg-target.c.inc  |  91 -
 tcg/arm/tcg-target.c.inc  | 410 +-
 tcg/i386/tcg-target.c.inc | 103 +-
 tcg/loongarch64/tcg-target.c.inc  |  73 +++-
 tcg/mips/tcg-target.c.inc | 387 ++--
 tcg/ppc/tcg-target.c.inc  |  98 +-
 tcg/riscv/tcg-target.c.inc|  63 +++-
 tcg/s390x/tcg-target.c.inc|  59 +++-
 tcg/sparc/tcg-target.c.inc| 348 +++---
 common-user/host/sparc64/safe-syscall.inc.S   |   5 +-
 36 files changed, 1561 insertions(+), 495

[PULL 08/34] replay: use CF_NOIRQ for special exception-replaying TB

2022-02-10 Thread Richard Henderson

From: Pavel Dovgalyuk 

Commit aff0e204cb1f1c036a496c94c15f5dfafcd9b4b4 introduced CF_NOIRQ usage,
but one case was forgotten. Record/replay uses one special TB which is not
really executed, but used to cause a correct exception in replay mode.
This patch adds CF_NOIRQ flag for such block.

Signed-off-by: Pavel Dovgalyuk 
Reviewed-by: Richard Henderson 
Message-Id: <164362834054.1754532.7678416881159817273.stgit@pasha-ThinkPad-X280>
Signed-off-by: Richard Henderson 
---
 accel/tcg/cpu-exec.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index 8b4cd6c59d..8da6a55593 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -648,7 +648,8 @@ static inline bool cpu_handle_exception(CPUState *cpu, int 
*ret)
 if (replay_has_exception()
 && cpu_neg(cpu)->icount_decr.u16.low + cpu->icount_extra == 0) {
 /* Execute just one insn to trigger exception pending in the log */
-cpu->cflags_next_tb = (curr_cflags(cpu) & ~CF_USE_ICOUNT) | 1;
+cpu->cflags_next_tb = (curr_cflags(cpu) & ~CF_USE_ICOUNT)
+| CF_NOIRQ | 1;
 }
 #endif
 return false;
-- 
2.25.1

Re: [PATCH 11/15] target: Use ArchCPU as interface to target CPU

2022-02-10 Thread Richard Henderson


On 2/11/22 04:35, Taylor Simpson wrote:

-#define HEXAGON_CPU_CLASS(klass) \
-OBJECT_CLASS_CHECK(HexagonCPUClass, (klass), TYPE_HEXAGON_CPU)
-#define HEXAGON_CPU(obj) \
-OBJECT_CHECK(HexagonCPU, (obj), TYPE_HEXAGON_CPU)
-#define HEXAGON_CPU_GET_CLASS(obj) \
-OBJECT_GET_CLASS(HexagonCPUClass, (obj), TYPE_HEXAGON_CPU)
+OBJECT_DECLARE_TYPE(HexagonCPU, HexagonCPUClass, HEXAGON_CPU)
  
  typedef struct HexagonCPUClass {

  /*< private >*/

If that's correct, the typedef struct HexagonCPUClass should NOT change to 
typedef struct ArchCPU, and the typdef of ArchCPU below would stay.



This is the change you'd make with the current state of the world, yes.



So, If I submit the above as a standalone patch, then Philippe wouldn't need to 
modify target/hexagon/cpu.h.  Correct?


But no, Phil would need a change, because he introduces

typedef struct ArchCPU ArchCPU;

as a generic typedef very early.  You cannot then redefine

typedef struct HexagonCPU ArchCPU;

which means that we still have to rearrange the direction of to

typedef ArchCPU HexagonCPU;

etc.  But it's definitely a smaller change (and matches all of the other 
targets).

I do think that the conversion to OBJECT_DECLARE_TYPE should happen first, via whichever 
tree you choose.



r~

Re: [PATCH v2 12/12] Hexagon (target/hexagon) assignment to c4 should wait until packet commit

2022-02-10 Thread Richard Henderson


On 2/10/22 13:15, Taylor Simpson wrote:

On Hexagon, c4 is an alias for predicate registers P3:0.  If we assign to
c4 inside a packet with reads from predicate registers, the predicate
reads should get the old values.

Test case added to tests/tcg/hexagon/preg_alias.c

Co-authored-by: Michael Lambert
Signed-off-by: Taylor Simpson
---
  target/hexagon/genptr.c| 14 -
  tests/tcg/hexagon/preg_alias.c | 38 ++
  2 files changed, 47 insertions(+), 5 deletions(-)


Reviewed-by: Richard Henderson 

r~

Re: [PATCH v2 10/12] Hexagon (target/hexagon) fix bug in conv_df2uw_chop

2022-02-10 Thread Richard Henderson


On 2/10/22 13:15, Taylor Simpson wrote:

Fix typo that checked for 32 bit nan instead of 64 bit

Test case added in tests/tcg/hexagon/usr.c

Signed-off-by: Taylor Simpson
---
  target/hexagon/op_helper.c | 2 +-
  tests/tcg/hexagon/usr.c| 4 
  2 files changed, 5 insertions(+), 1 deletion(-)


Reviewed-by: Richard Henderson 

r~

Re: [PATCH v2 08/12] Hexagon (tests/tcg/hexagon) update overflow test

2022-02-10 Thread Richard Henderson


On 2/10/22 13:15, Taylor Simpson wrote:

Add a test that sets USR multiple times in a packet

Signed-off-by: Taylor Simpson
---
  tests/tcg/hexagon/overflow.c | 61 +++-
  1 file changed, 60 insertions(+), 1 deletion(-)


Acked-by: Richard Henderson 

r~

Re: [PATCH v2 07/12] Hexagon (tests/tcg/hexagon) add floating point instructions to usr.c

2022-02-10 Thread Richard Henderson


On 2/10/22 13:15, Taylor Simpson wrote:

Tests to confirm floating point instructions are properly
setting exception bits in USR

Signed-off-by: Taylor Simpson
---
  tests/tcg/hexagon/usr.c | 339 
  1 file changed, 339 insertions(+)


Acked-by: Richard Henderson 

r~

Re: [PATCH v2 06/12] Hexagon (tests/tcg/hexagon) test instructions that might set bits in USR

2022-02-10 Thread Richard Henderson


On 2/10/22 13:15, Taylor Simpson wrote:

+#define CLEAR_USRBITS \
+"r2 = usr\n\t" \
+"r2 = clrbit(r2, #0)\n\t" \
+"r2 = clrbit(r2, #1)\n\t" \
+"r2 = clrbit(r2, #2)\n\t" \
+"r2 = clrbit(r2, #3)\n\t" \
+"r2 = clrbit(r2, #4)\n\t" \
+"r2 = clrbit(r2, #5)\n\t" \
+"usr = r2\n\t"


It's just a test case, so it doesn't really matter, but

r2 = and(r2, #~0x3f)

surely?

Otherwise,
Acked-by: Richard Henderson 


r~

Re: [PATCH v5 03/18] pci: isolated address space for PCI bus

2022-02-10 Thread Jag Raman



> On Feb 10, 2022, at 7:26 PM, Michael S. Tsirkin  wrote:
> 
> On Thu, Feb 10, 2022 at 04:49:33PM -0700, Alex Williamson wrote:
>> On Thu, 10 Feb 2022 18:28:56 -0500
>> "Michael S. Tsirkin"  wrote:
>> 
>>> On Thu, Feb 10, 2022 at 04:17:34PM -0700, Alex Williamson wrote:
 On Thu, 10 Feb 2022 22:23:01 +
 Jag Raman  wrote:
 
>> On Feb 10, 2022, at 3:02 AM, Michael S. Tsirkin  wrote:
>> 
>> On Thu, Feb 10, 2022 at 12:08:27AM +, Jag Raman wrote:
>>> 
>>> Thanks for the explanation, Alex. Thanks to everyone else in the thread 
>>> who
>>> helped to clarify this problem.
>>> 
>>> We have implemented the memory isolation based on the discussion in the
>>> thread. We will send the patches out shortly.
>>> 
>>> Devices such as “name" and “e1000” worked fine. But I’d like to note 
>>> that
>>> the LSI device (TYPE_LSI53C895A) had some problems - it doesn’t seem
>>> to be IOMMU aware. In LSI’s case, the kernel driver is asking the 
>>> device to
>>> read instructions from the CPU VA (lsi_execute_script() -> 
>>> read_dword()),
>>> which is forbidden when IOMMU is enabled. Specifically, the driver is 
>>> asking
>>> the device to access other BAR regions by using the BAR address 
>>> programmed
>>> in the PCI config space. This happens even without vfio-user patches. 
>>> For example,
>>> we could enable IOMMU using “-device intel-iommu” QEMU option and also
>>> adding the following to the kernel command-line: “intel_iommu=on 
>>> iommu=nopt”.
>>> In this case, we could see an IOMMU fault.
>> 
>> So, device accessing its own BAR is different. Basically, these
>> transactions never go on the bus at all, never mind get to the IOMMU.
> 
> Hi Michael,
> 
> In LSI case, I did notice that it went to the IOMMU. The device is 
> reading the BAR
> address as if it was a DMA address.
> 
>> I think it's just used as a handle to address internal device memory.
>> This kind of trick is not universal, but not terribly unusual.
>> 
>> 
>>> Unfortunately, we started off our project with the LSI device. So that 
>>> lead to all the
>>> confusion about what is expected at the server end in-terms of
>>> vectoring/address-translation. It gave an impression as if the request 
>>> was still on
>>> the CPU side of the PCI root complex, but the actual problem was with 
>>> the
>>> device driver itself.
>>> 
>>> I’m wondering how to deal with this problem. Would it be OK if we 
>>> mapped the
>>> device’s BAR into the IOVA, at the same CPU VA programmed in the BAR 
>>> registers?
>>> This would help devices such as LSI to circumvent this problem. One 
>>> problem
>>> with this approach is that it has the potential to collide with another 
>>> legitimate
>>> IOVA address. Kindly share your thought on this.
>>> 
>>> Thank you!
>> 
>> I am not 100% sure what do you plan to do but it sounds fine since even
>> if it collides, with traditional PCI device must never initiate cycles   
>>  
> 
> OK sounds good, I’ll create a mapping of the device BARs in the IOVA.  
 
 I don't think this is correct.  Look for instance at ACPI _TRA support
 where a system can specify a translation offset such that, for example,
 a CPU access to a device is required to add the provided offset to the
 bus address of the device.  A system using this could have multiple
 root bridges, where each is given the same, overlapping MMIO aperture.  
> From the processor perspective, each MMIO range is unique and possibly  
 none of those devices have a zero _TRA, there could be system memory at
 the equivalent flat memory address.  
>>> 
>>> I am guessing there are reasons to have these in acpi besides firmware
>>> vendors wanting to find corner cases in device implementations though
>>> :). E.g. it's possible something else is tweaking DMA in similar ways. I
>>> can't say for sure and I wonder why do we care as long as QEMU does not
>>> have _TRA.
>> 
>> How many complaints do we get about running out of I/O port space on
>> q35 because we allow an arbitrary number of root ports?  What if we
>> used _TRA to provide the full I/O port range per root port?  32-bit
>> MMIO could be duplicated as well.
> 
> It's an interesting idea. To clarify what I said, I suspect some devices
> are broken in presence of translating bridges unless DMA
> is also translated to match.
> 
> I agree it's a mess though, in that some devices when given their own
> BAR to DMA to will probably just satisfy the access from internal
> memory, while others will ignore that and send it up as DMA
> and both types are probably out there in the field.
> 
> 
 So if the transaction actually hits this bus, which I think is what
 making use of the device AddressSpace imp

Re: [PATCH v2 04/12] Hexagon (target/hexagon) properly handle SNaN in dfmin/dfmax/sfmin/sfmax

2022-02-10 Thread Richard Henderson


On 2/10/22 13:15, Taylor Simpson wrote:

The float??_minnum implementation differs from Hexagon for SNaN,
it returns NaN, but Hexagon returns the other input.  So, we add
checks for NaN before calling it.

test cases added in a subsequent patch to more extensively test USR bits

Signed-off-by: Taylor Simpson 


This appears to be the same as the IEEE 754-2019 minimumNumber (as opposed to the earlier 
754-2008 minNum), which a recent RISC-V revision adopted.  We added support for that 
directly in softfloat: float32_minimum_number et al.



r~

Re: [PATCH v2 05/12] Hexagon (target/hexagon) properly handle denorm in arch_sf_recip_common

2022-02-10 Thread Richard Henderson


On 2/10/22 13:15, Taylor Simpson wrote:

The arch_sf_recip_common function was calling float32_getexp which
adjusts for denorm, but the we actually need the raw exponent bits.

This function is called from 3 instructions
 sfrecipa
 sffixupn
 sffixupd

Test cases added to tests/tcg/hexagon/fpstuff.c

Signed-off-by: Taylor Simpson
---
  target/hexagon/fma_emu.h|  6 -
  target/hexagon/arch.c   |  6 ++---
  tests/tcg/hexagon/fpstuff.c | 44 ++---
  3 files changed, 49 insertions(+), 7 deletions(-)


Reviewed-by: Richard Henderson 

r~

Re: [PATCH v2 03/12] Hexagon (target/hexagon) properly set FPINVF bit in sfcmp.uo and dfcmp.uo

2022-02-10 Thread Richard Henderson


On 2/10/22 13:15, Taylor Simpson wrote:

Instead of checking for nan arguments, use float??_unordered_quiet

test cases added in a subsequent patch to more extensively test USR bits

Signed-off-by: Taylor Simpson
---
  target/hexagon/op_helper.c | 6 ++
  1 file changed, 2 insertions(+), 4 deletions(-)


Reviewed-by: Richard Henderson 

r~

Re: [PATCH v2 02/12] Hexagon HVX (target/hexagon) fix bug in HVX saturate instructions

2022-02-10 Thread Richard Henderson


On 2/10/22 13:15, Taylor Simpson wrote:

Two tests added to tests/tcg/hexagon/hvx_misc.c
 v21.uw = vadd(v11.uw, v10.uw):sat
 v25:24.uw = vsub(v17:16.uw, v27:26.uw):sat

Signed-off-by: Taylor Simpson
---
  target/hexagon/macros.h  |  4 +-
  tests/tcg/hexagon/hvx_misc.c | 71 +++-
  2 files changed, 72 insertions(+), 3 deletions(-)


Reviewed-by: Richard Henderson 


r~

Re: [PATCH v2 01/12] Hexagon (target/hexagon) fix bug in circular addressing

2022-02-10 Thread Richard Henderson


On 2/10/22 13:15, Taylor Simpson wrote:

From: Michael Lambert 

Versions V3 and earlier should treat the "K_const" and "length" values
as unsigned.

Modified circ_test_v3() in tests/tcg/hexagon/circ.c to reproduce the bug

Signed-off-by: Michael Lambert 
Signed-off-by: Taylor Simpson 
---
  target/hexagon/op_helper.c | 6 +++---
  tests/tcg/hexagon/circ.c   | 5 +++--
  2 files changed, 6 insertions(+), 5 deletions(-)


Reviewed-by: Richard Henderson 


r~

Re: [PATCH v5 03/18] pci: isolated address space for PCI bus

2022-02-10 Thread Michael S. Tsirkin

On Thu, Feb 10, 2022 at 04:49:33PM -0700, Alex Williamson wrote:
> On Thu, 10 Feb 2022 18:28:56 -0500
> "Michael S. Tsirkin"  wrote:
> 
> > On Thu, Feb 10, 2022 at 04:17:34PM -0700, Alex Williamson wrote:
> > > On Thu, 10 Feb 2022 22:23:01 +
> > > Jag Raman  wrote:
> > >   
> > > > > On Feb 10, 2022, at 3:02 AM, Michael S. Tsirkin  
> > > > > wrote:
> > > > > 
> > > > > On Thu, Feb 10, 2022 at 12:08:27AM +, Jag Raman wrote:
> > > > >> 
> > > > >> Thanks for the explanation, Alex. Thanks to everyone else in the 
> > > > >> thread who
> > > > >> helped to clarify this problem.
> > > > >> 
> > > > >> We have implemented the memory isolation based on the discussion in 
> > > > >> the
> > > > >> thread. We will send the patches out shortly.
> > > > >> 
> > > > >> Devices such as “name" and “e1000” worked fine. But I’d like to note 
> > > > >> that
> > > > >> the LSI device (TYPE_LSI53C895A) had some problems - it doesn’t seem
> > > > >> to be IOMMU aware. In LSI’s case, the kernel driver is asking the 
> > > > >> device to
> > > > >> read instructions from the CPU VA (lsi_execute_script() -> 
> > > > >> read_dword()),
> > > > >> which is forbidden when IOMMU is enabled. Specifically, the driver 
> > > > >> is asking
> > > > >> the device to access other BAR regions by using the BAR address 
> > > > >> programmed
> > > > >> in the PCI config space. This happens even without vfio-user 
> > > > >> patches. For example,
> > > > >> we could enable IOMMU using “-device intel-iommu” QEMU option and 
> > > > >> also
> > > > >> adding the following to the kernel command-line: “intel_iommu=on 
> > > > >> iommu=nopt”.
> > > > >> In this case, we could see an IOMMU fault.
> > > > > 
> > > > > So, device accessing its own BAR is different. Basically, these
> > > > > transactions never go on the bus at all, never mind get to the IOMMU. 
> > > > >
> > > > 
> > > > Hi Michael,
> > > > 
> > > > In LSI case, I did notice that it went to the IOMMU. The device is 
> > > > reading the BAR
> > > > address as if it was a DMA address.
> > > >   
> > > > > I think it's just used as a handle to address internal device memory.
> > > > > This kind of trick is not universal, but not terribly unusual.
> > > > > 
> > > > > 
> > > > >> Unfortunately, we started off our project with the LSI device. So 
> > > > >> that lead to all the
> > > > >> confusion about what is expected at the server end in-terms of
> > > > >> vectoring/address-translation. It gave an impression as if the 
> > > > >> request was still on
> > > > >> the CPU side of the PCI root complex, but the actual problem was 
> > > > >> with the
> > > > >> device driver itself.
> > > > >> 
> > > > >> I’m wondering how to deal with this problem. Would it be OK if we 
> > > > >> mapped the
> > > > >> device’s BAR into the IOVA, at the same CPU VA programmed in the BAR 
> > > > >> registers?
> > > > >> This would help devices such as LSI to circumvent this problem. One 
> > > > >> problem
> > > > >> with this approach is that it has the potential to collide with 
> > > > >> another legitimate
> > > > >> IOVA address. Kindly share your thought on this.
> > > > >> 
> > > > >> Thank you!
> > > > > 
> > > > > I am not 100% sure what do you plan to do but it sounds fine since 
> > > > > even
> > > > > if it collides, with traditional PCI device must never initiate 
> > > > > cycles
> > > > 
> > > > OK sounds good, I’ll create a mapping of the device BARs in the IOVA.  
> > > 
> > > I don't think this is correct.  Look for instance at ACPI _TRA support
> > > where a system can specify a translation offset such that, for example,
> > > a CPU access to a device is required to add the provided offset to the
> > > bus address of the device.  A system using this could have multiple
> > > root bridges, where each is given the same, overlapping MMIO aperture.  
> > > >From the processor perspective, each MMIO range is unique and possibly  
> > > none of those devices have a zero _TRA, there could be system memory at
> > > the equivalent flat memory address.  
> > 
> > I am guessing there are reasons to have these in acpi besides firmware
> > vendors wanting to find corner cases in device implementations though
> > :). E.g. it's possible something else is tweaking DMA in similar ways. I
> > can't say for sure and I wonder why do we care as long as QEMU does not
> > have _TRA.
> 
> How many complaints do we get about running out of I/O port space on
> q35 because we allow an arbitrary number of root ports?  What if we
> used _TRA to provide the full I/O port range per root port?  32-bit
> MMIO could be duplicated as well.

It's an interesting idea. To clarify what I said, I suspect some devices
are broken in presence of translating bridges unless DMA
is also translated to match.

I agree it's a mess though, in that some devices when given their own
BAR to DMA to will probably just satisfy the access from internal
memory, while others will ignore that and send it up as

Re: [PATCH v5 03/18] pci: isolated address space for PCI bus

2022-02-10 Thread Jag Raman



> On Feb 10, 2022, at 6:17 PM, Alex Williamson  
> wrote:
> 
> On Thu, 10 Feb 2022 22:23:01 +
> Jag Raman  wrote:
> 
>>> On Feb 10, 2022, at 3:02 AM, Michael S. Tsirkin  wrote:
>>> 
>>> On Thu, Feb 10, 2022 at 12:08:27AM +, Jag Raman wrote:  
 
 Thanks for the explanation, Alex. Thanks to everyone else in the thread who
 helped to clarify this problem.
 
 We have implemented the memory isolation based on the discussion in the
 thread. We will send the patches out shortly.
 
 Devices such as “name" and “e1000” worked fine. But I’d like to note that
 the LSI device (TYPE_LSI53C895A) had some problems - it doesn’t seem
 to be IOMMU aware. In LSI’s case, the kernel driver is asking the device to
 read instructions from the CPU VA (lsi_execute_script() -> read_dword()),
 which is forbidden when IOMMU is enabled. Specifically, the driver is 
 asking
 the device to access other BAR regions by using the BAR address programmed
 in the PCI config space. This happens even without vfio-user patches. For 
 example,
 we could enable IOMMU using “-device intel-iommu” QEMU option and also
 adding the following to the kernel command-line: “intel_iommu=on 
 iommu=nopt”.
 In this case, we could see an IOMMU fault.  
>>> 
>>> So, device accessing its own BAR is different. Basically, these
>>> transactions never go on the bus at all, never mind get to the IOMMU.  
>> 
>> Hi Michael,
>> 
>> In LSI case, I did notice that it went to the IOMMU. The device is reading 
>> the BAR
>> address as if it was a DMA address.
>> 
>>> I think it's just used as a handle to address internal device memory.
>>> This kind of trick is not universal, but not terribly unusual.
>>> 
>>> 
 Unfortunately, we started off our project with the LSI device. So that 
 lead to all the
 confusion about what is expected at the server end in-terms of
 vectoring/address-translation. It gave an impression as if the request was 
 still on
 the CPU side of the PCI root complex, but the actual problem was with the
 device driver itself.
 
 I’m wondering how to deal with this problem. Would it be OK if we mapped 
 the
 device’s BAR into the IOVA, at the same CPU VA programmed in the BAR 
 registers?
 This would help devices such as LSI to circumvent this problem. One problem
 with this approach is that it has the potential to collide with another 
 legitimate
 IOVA address. Kindly share your thought on this.
 
 Thank you!  
>>> 
>>> I am not 100% sure what do you plan to do but it sounds fine since even
>>> if it collides, with traditional PCI device must never initiate cycles  
>> 
>> OK sounds good, I’ll create a mapping of the device BARs in the IOVA.
> 
> I don't think this is correct.  Look for instance at ACPI _TRA support
> where a system can specify a translation offset such that, for example,
> a CPU access to a device is required to add the provided offset to the
> bus address of the device.  A system using this could have multiple
> root bridges, where each is given the same, overlapping MMIO aperture.
> From the processor perspective, each MMIO range is unique and possibly
> none of those devices have a zero _TRA, there could be system memory at
> the equivalent flat memory address.
> 
> So if the transaction actually hits this bus, which I think is what
> making use of the device AddressSpace implies, I don't think it can
> assume that it's simply reflected back at itself.  Conventional PCI and
> PCI Express may be software compatible, but there's a reason we don't
> see IOMMUs that provide both translation and isolation in conventional
> topologies.
> 
> Is this more a bug in the LSI device emulation model?  For instance in
> vfio-pci, if I want to access an offset into a BAR from within QEMU, I
> don't care what address is programmed into that BAR, I perform an
> access relative to the vfio file descriptor region representing that
> BAR space.  I'd expect that any viable device emulation model does the
> same, an access to device memory uses an offset from an internal
> resource, irrespective of the BAR address.
> 
> It would seem strange if the driver is actually programming the device
> to DMA to itself and if that's actually happening, I'd wonder if this

It does look like the driver is actually programming the device to DMA to 
itself.

The driver first programs the DSP (DMA Scripts Pointer) register with the BAR
address. It does so by performing a series of MMIO writes (lsi_mmio_write())
to offsets 0x2C - 0x2F. Immediately after programming this register, the device
fetches some instructions located at the programmed address.

Thank you!
--
Jag

> driver is actually compatible with an IOMMU on bare metal.
> 
>>> within their own BAR range, and PCIe is software-compatible with PCI. So
>>> devices won't be able to access this IOVA even if it was programmed in
>>> the IOMMU.
>>> 
>>> As was

Re: [PATCH v5 03/18] pci: isolated address space for PCI bus

2022-02-10 Thread Alex Williamson

On Thu, 10 Feb 2022 18:28:56 -0500
"Michael S. Tsirkin"  wrote:

> On Thu, Feb 10, 2022 at 04:17:34PM -0700, Alex Williamson wrote:
> > On Thu, 10 Feb 2022 22:23:01 +
> > Jag Raman  wrote:
> >   
> > > > On Feb 10, 2022, at 3:02 AM, Michael S. Tsirkin  wrote:
> > > > 
> > > > On Thu, Feb 10, 2022 at 12:08:27AM +, Jag Raman wrote:
> > > >> 
> > > >> Thanks for the explanation, Alex. Thanks to everyone else in the 
> > > >> thread who
> > > >> helped to clarify this problem.
> > > >> 
> > > >> We have implemented the memory isolation based on the discussion in the
> > > >> thread. We will send the patches out shortly.
> > > >> 
> > > >> Devices such as “name" and “e1000” worked fine. But I’d like to note 
> > > >> that
> > > >> the LSI device (TYPE_LSI53C895A) had some problems - it doesn’t seem
> > > >> to be IOMMU aware. In LSI’s case, the kernel driver is asking the 
> > > >> device to
> > > >> read instructions from the CPU VA (lsi_execute_script() -> 
> > > >> read_dword()),
> > > >> which is forbidden when IOMMU is enabled. Specifically, the driver is 
> > > >> asking
> > > >> the device to access other BAR regions by using the BAR address 
> > > >> programmed
> > > >> in the PCI config space. This happens even without vfio-user patches. 
> > > >> For example,
> > > >> we could enable IOMMU using “-device intel-iommu” QEMU option and also
> > > >> adding the following to the kernel command-line: “intel_iommu=on 
> > > >> iommu=nopt”.
> > > >> In this case, we could see an IOMMU fault.
> > > > 
> > > > So, device accessing its own BAR is different. Basically, these
> > > > transactions never go on the bus at all, never mind get to the IOMMU.   
> > > >  
> > > 
> > > Hi Michael,
> > > 
> > > In LSI case, I did notice that it went to the IOMMU. The device is 
> > > reading the BAR
> > > address as if it was a DMA address.
> > >   
> > > > I think it's just used as a handle to address internal device memory.
> > > > This kind of trick is not universal, but not terribly unusual.
> > > > 
> > > > 
> > > >> Unfortunately, we started off our project with the LSI device. So that 
> > > >> lead to all the
> > > >> confusion about what is expected at the server end in-terms of
> > > >> vectoring/address-translation. It gave an impression as if the request 
> > > >> was still on
> > > >> the CPU side of the PCI root complex, but the actual problem was with 
> > > >> the
> > > >> device driver itself.
> > > >> 
> > > >> I’m wondering how to deal with this problem. Would it be OK if we 
> > > >> mapped the
> > > >> device’s BAR into the IOVA, at the same CPU VA programmed in the BAR 
> > > >> registers?
> > > >> This would help devices such as LSI to circumvent this problem. One 
> > > >> problem
> > > >> with this approach is that it has the potential to collide with 
> > > >> another legitimate
> > > >> IOVA address. Kindly share your thought on this.
> > > >> 
> > > >> Thank you!
> > > > 
> > > > I am not 100% sure what do you plan to do but it sounds fine since even
> > > > if it collides, with traditional PCI device must never initiate cycles  
> > > >   
> > > 
> > > OK sounds good, I’ll create a mapping of the device BARs in the IOVA.  
> > 
> > I don't think this is correct.  Look for instance at ACPI _TRA support
> > where a system can specify a translation offset such that, for example,
> > a CPU access to a device is required to add the provided offset to the
> > bus address of the device.  A system using this could have multiple
> > root bridges, where each is given the same, overlapping MMIO aperture.  
> > >From the processor perspective, each MMIO range is unique and possibly  
> > none of those devices have a zero _TRA, there could be system memory at
> > the equivalent flat memory address.  
> 
> I am guessing there are reasons to have these in acpi besides firmware
> vendors wanting to find corner cases in device implementations though
> :). E.g. it's possible something else is tweaking DMA in similar ways. I
> can't say for sure and I wonder why do we care as long as QEMU does not
> have _TRA.

How many complaints do we get about running out of I/O port space on
q35 because we allow an arbitrary number of root ports?  What if we
used _TRA to provide the full I/O port range per root port?  32-bit
MMIO could be duplicated as well.

> > So if the transaction actually hits this bus, which I think is what
> > making use of the device AddressSpace implies, I don't think it can
> > assume that it's simply reflected back at itself.  Conventional PCI and
> > PCI Express may be software compatible, but there's a reason we don't
> > see IOMMUs that provide both translation and isolation in conventional
> > topologies.
> > 
> > Is this more a bug in the LSI device emulation model?  For instance in
> > vfio-pci, if I want to access an offset into a BAR from within QEMU, I
> > don't care what address is programmed into that BAR, I perform an
> > access relative to the vfio file descript

Re: [PATCH v5 03/18] pci: isolated address space for PCI bus

2022-02-10 Thread Jag Raman



> On Feb 10, 2022, at 5:53 PM, Michael S. Tsirkin  wrote:
> 
> On Thu, Feb 10, 2022 at 10:23:01PM +, Jag Raman wrote:
>> 
>> 
>>> On Feb 10, 2022, at 3:02 AM, Michael S. Tsirkin  wrote:
>>> 
>>> On Thu, Feb 10, 2022 at 12:08:27AM +, Jag Raman wrote:
 
 
> On Feb 2, 2022, at 12:34 AM, Alex Williamson  
> wrote:
> 
> On Wed, 2 Feb 2022 01:13:22 +
> Jag Raman  wrote:
> 
>>> On Feb 1, 2022, at 5:47 PM, Alex Williamson 
>>>  wrote:
>>> 
>>> On Tue, 1 Feb 2022 21:24:08 +
>>> Jag Raman  wrote:
>>> 
> On Feb 1, 2022, at 10:24 AM, Alex Williamson 
>  wrote:
> 
> On Tue, 1 Feb 2022 09:30:35 +
> Stefan Hajnoczi  wrote:
> 
>> On Mon, Jan 31, 2022 at 09:16:23AM -0700, Alex Williamson wrote:
>>> On Fri, 28 Jan 2022 09:18:08 +
>>> Stefan Hajnoczi  wrote:
>>> 
 On Thu, Jan 27, 2022 at 02:22:53PM -0700, Alex Williamson wrote:   

> If the goal here is to restrict DMA between devices, ie. 
> peer-to-peer
> (p2p), why are we trying to re-invent what an IOMMU already does? 
>
 
 The issue Dave raised is that vfio-user servers run in separate
 processses from QEMU with shared memory access to RAM but no direct
 access to non-RAM MemoryRegions. The virtiofs DAX Window BAR is one
 example of a non-RAM MemoryRegion that can be the source/target of 
 DMA
 requests.
 
 I don't think IOMMUs solve this problem but luckily the vfio-user
 protocol already has messages that vfio-user servers can use as a
 fallback when DMA cannot be completed through the shared memory RAM
 accesses.
 
> In
> fact, it seems like an IOMMU does this better in providing an IOVA
> address space per BDF.  Is the dynamic mapping overhead too much? 
>  What
> physical hardware properties or specifications could we leverage 
> to
> restrict p2p mappings to a device?  Should it be governed by 
> machine
> type to provide consistency between devices?  Should each 
> "isolated"
> bus be in a separate root complex?  Thanks,
 
 There is a separate issue in this patch series regarding isolating 
 the
 address space where BAR accesses are made (i.e. the global
 address_space_memory/io). When one process hosts multiple vfio-user
 server instances (e.g. a software-defined network switch with 
 multiple
 ethernet devices) then each instance needs isolated memory and io 
 address
 spaces so that vfio-user clients don't cause collisions when they 
 map
 BARs to the same address.
 
 I think the the separate root complex idea is a good solution. This
 patch series takes a different approach by adding the concept of
 isolated address spaces into hw/pci/.  
>>> 
>>> This all still seems pretty sketchy, BARs cannot overlap within the
>>> same vCPU address space, perhaps with the exception of when they're
>>> being sized, but DMA should be disabled during sizing.
>>> 
>>> Devices within the same VM context with identical BARs would need to
>>> operate in different address spaces.  For example a translation 
>>> offset
>>> in the vCPU address space would allow unique addressing to the 
>>> devices,
>>> perhaps using the translation offset bits to address a root complex 
>>> and
>>> masking those bits for downstream transactions.
>>> 
>>> In general, the device simply operates in an address space, ie. an
>>> IOVA.  When a mapping is made within that address space, we perform 
>>> a
>>> translation as necessary to generate a guest physical address.  The
>>> IOVA itself is only meaningful within the context of the address 
>>> space,
>>> there is no requirement or expectation for it to be globally unique.
>>> 
>>> If the vfio-user server is making some sort of requirement that 
>>> IOVAs
>>> are unique across all devices, that seems very, very wrong.  
>>> Thanks,  
>> 
>> Yes, BARs and IOVAs don't need to be unique across all devices.
>> 
>> The issue is that there can be as many guest physical address spaces 
>> as
>> there are vfio-user clients connected, so per-client isolated address
>> spaces are required. This patch series has a solution to that problem
>> with the new pci_isol_

Re: [PATCH v5 03/18] pci: isolated address space for PCI bus

2022-02-10 Thread Michael S. Tsirkin

On Thu, Feb 10, 2022 at 04:17:34PM -0700, Alex Williamson wrote:
> On Thu, 10 Feb 2022 22:23:01 +
> Jag Raman  wrote:
> 
> > > On Feb 10, 2022, at 3:02 AM, Michael S. Tsirkin  wrote:
> > > 
> > > On Thu, Feb 10, 2022 at 12:08:27AM +, Jag Raman wrote:  
> > >> 
> > >> Thanks for the explanation, Alex. Thanks to everyone else in the thread 
> > >> who
> > >> helped to clarify this problem.
> > >> 
> > >> We have implemented the memory isolation based on the discussion in the
> > >> thread. We will send the patches out shortly.
> > >> 
> > >> Devices such as “name" and “e1000” worked fine. But I’d like to note that
> > >> the LSI device (TYPE_LSI53C895A) had some problems - it doesn’t seem
> > >> to be IOMMU aware. In LSI’s case, the kernel driver is asking the device 
> > >> to
> > >> read instructions from the CPU VA (lsi_execute_script() -> read_dword()),
> > >> which is forbidden when IOMMU is enabled. Specifically, the driver is 
> > >> asking
> > >> the device to access other BAR regions by using the BAR address 
> > >> programmed
> > >> in the PCI config space. This happens even without vfio-user patches. 
> > >> For example,
> > >> we could enable IOMMU using “-device intel-iommu” QEMU option and also
> > >> adding the following to the kernel command-line: “intel_iommu=on 
> > >> iommu=nopt”.
> > >> In this case, we could see an IOMMU fault.  
> > > 
> > > So, device accessing its own BAR is different. Basically, these
> > > transactions never go on the bus at all, never mind get to the IOMMU.  
> > 
> > Hi Michael,
> > 
> > In LSI case, I did notice that it went to the IOMMU. The device is reading 
> > the BAR
> > address as if it was a DMA address.
> > 
> > > I think it's just used as a handle to address internal device memory.
> > > This kind of trick is not universal, but not terribly unusual.
> > > 
> > >   
> > >> Unfortunately, we started off our project with the LSI device. So that 
> > >> lead to all the
> > >> confusion about what is expected at the server end in-terms of
> > >> vectoring/address-translation. It gave an impression as if the request 
> > >> was still on
> > >> the CPU side of the PCI root complex, but the actual problem was with the
> > >> device driver itself.
> > >> 
> > >> I’m wondering how to deal with this problem. Would it be OK if we mapped 
> > >> the
> > >> device’s BAR into the IOVA, at the same CPU VA programmed in the BAR 
> > >> registers?
> > >> This would help devices such as LSI to circumvent this problem. One 
> > >> problem
> > >> with this approach is that it has the potential to collide with another 
> > >> legitimate
> > >> IOVA address. Kindly share your thought on this.
> > >> 
> > >> Thank you!  
> > > 
> > > I am not 100% sure what do you plan to do but it sounds fine since even
> > > if it collides, with traditional PCI device must never initiate cycles  
> > 
> > OK sounds good, I’ll create a mapping of the device BARs in the IOVA.
> 
> I don't think this is correct.  Look for instance at ACPI _TRA support
> where a system can specify a translation offset such that, for example,
> a CPU access to a device is required to add the provided offset to the
> bus address of the device.  A system using this could have multiple
> root bridges, where each is given the same, overlapping MMIO aperture.
> >From the processor perspective, each MMIO range is unique and possibly
> none of those devices have a zero _TRA, there could be system memory at
> the equivalent flat memory address.

I am guessing there are reasons to have these in acpi besides firmware
vendors wanting to find corner cases in device implementations though
:). E.g. it's possible something else is tweaking DMA in similar ways. I
can't say for sure and I wonder why do we care as long as QEMU does not
have _TRA.


> So if the transaction actually hits this bus, which I think is what
> making use of the device AddressSpace implies, I don't think it can
> assume that it's simply reflected back at itself.  Conventional PCI and
> PCI Express may be software compatible, but there's a reason we don't
> see IOMMUs that provide both translation and isolation in conventional
> topologies.
> 
> Is this more a bug in the LSI device emulation model?  For instance in
> vfio-pci, if I want to access an offset into a BAR from within QEMU, I
> don't care what address is programmed into that BAR, I perform an
> access relative to the vfio file descriptor region representing that
> BAR space.  I'd expect that any viable device emulation model does the
> same, an access to device memory uses an offset from an internal
> resource, irrespective of the BAR address.

However, using BAR seems like a reasonable shortcut allowing
device to use the same 64 bit address to refer to system
and device RAM interchangeably.

> It would seem strange if the driver is actually programming the device
> to DMA to itself and if that's actually happening, I'd wonder if this
> driver is actually compatible with an IOMMU on bare

Re: [PATCH 1/9] accel/tcg: Add missing 'tcg/tcg.h' header

2022-02-10 Thread Richard Henderson


On 2/10/22 10:00, Philippe Mathieu-Daudé wrote:

Signed-off-by: Philippe Mathieu-Daudé 
---
  accel/tcg/tcg-accel-ops-icount.c | 1 +
  accel/tcg/tcg-accel-ops-mttcg.c  | 1 +
  accel/tcg/tcg-accel-ops-rr.c | 1 +
  accel/tcg/tcg-accel-ops.c| 1 +
  4 files changed, 4 insertions(+)


What exactly are these files using from tcg.h?
I briefly scanned tcg-accel-ops-icount.c and didn't see anything.


r~




diff --git a/accel/tcg/tcg-accel-ops-icount.c b/accel/tcg/tcg-accel-ops-icount.c
index bdaf2c943b..379a9d44f4 100644
--- a/accel/tcg/tcg-accel-ops-icount.c
+++ b/accel/tcg/tcg-accel-ops-icount.c
@@ -31,6 +31,7 @@
  #include "qemu/main-loop.h"
  #include "qemu/guest-random.h"
  #include "exec/exec-all.h"
+#include "tcg/tcg.h"
  
  #include "tcg-accel-ops.h"

  #include "tcg-accel-ops-icount.h"
diff --git a/accel/tcg/tcg-accel-ops-mttcg.c b/accel/tcg/tcg-accel-ops-mttcg.c
index dc421c8fd7..de7dcb02e6 100644
--- a/accel/tcg/tcg-accel-ops-mttcg.c
+++ b/accel/tcg/tcg-accel-ops-mttcg.c
@@ -33,6 +33,7 @@
  #include "qemu/guest-random.h"
  #include "exec/exec-all.h"
  #include "hw/boards.h"
+#include "tcg/tcg.h"
  
  #include "tcg-accel-ops.h"

  #include "tcg-accel-ops-mttcg.h"
diff --git a/accel/tcg/tcg-accel-ops-rr.c b/accel/tcg/tcg-accel-ops-rr.c
index a805fb6bdd..889d0882a2 100644
--- a/accel/tcg/tcg-accel-ops-rr.c
+++ b/accel/tcg/tcg-accel-ops-rr.c
@@ -32,6 +32,7 @@
  #include "qemu/notify.h"
  #include "qemu/guest-random.h"
  #include "exec/exec-all.h"
+#include "tcg/tcg.h"
  
  #include "tcg-accel-ops.h"

  #include "tcg-accel-ops-rr.h"
diff --git a/accel/tcg/tcg-accel-ops.c b/accel/tcg/tcg-accel-ops.c
index ea7dcad674..58e4b09043 100644
--- a/accel/tcg/tcg-accel-ops.c
+++ b/accel/tcg/tcg-accel-ops.c
@@ -33,6 +33,7 @@
  #include "qemu/main-loop.h"
  #include "qemu/guest-random.h"
  #include "exec/exec-all.h"
+#include "tcg/tcg.h"
  
  #include "tcg-accel-ops.h"

  #include "tcg-accel-ops-mttcg.h"

Re: [PATCH 9/9] user: Share preexit_cleanup() with linux and bsd implementations

2022-02-10 Thread Richard Henderson


On 2/10/22 10:00, Philippe Mathieu-Daudé wrote:

preexit_cleanup() is not Linux specific, move it to common-user/.

Signed-off-by: Philippe Mathieu-Daudé 
---
  {linux-user => common-user}/exit.c | 0
  common-user/meson.build| 1 +
  linux-user/meson.build | 1 -
  3 files changed, 1 insertion(+), 1 deletion(-)
  rename {linux-user => common-user}/exit.c (100%)


Reviewed-by: Richard Henderson 

Of course, the next step is to use the function (cc Warner).


r~



diff --git a/linux-user/exit.c b/common-user/exit.c
similarity index 100%
rename from linux-user/exit.c
rename to common-user/exit.c
diff --git a/common-user/meson.build b/common-user/meson.build
index 26212dda5c..7204f8bd61 100644
--- a/common-user/meson.build
+++ b/common-user/meson.build
@@ -1,6 +1,7 @@
  common_user_inc += include_directories('host/' / host_arch)
  
  user_ss.add(files(

+  'exit.c',
'safe-syscall.S',
'safe-syscall-error.c',
  ))
diff --git a/linux-user/meson.build b/linux-user/meson.build
index de4320af05..25756a2518 100644
--- a/linux-user/meson.build
+++ b/linux-user/meson.build
@@ -9,7 +9,6 @@ common_user_inc += include_directories('include')
  
  linux_user_ss.add(files(

'elfload.c',
-  'exit.c',
'fd-trans.c',
'linuxload.c',
'main.c',

Re: [PATCH v5 03/18] pci: isolated address space for PCI bus

2022-02-10 Thread Alex Williamson

On Thu, 10 Feb 2022 22:23:01 +
Jag Raman  wrote:

> > On Feb 10, 2022, at 3:02 AM, Michael S. Tsirkin  wrote:
> > 
> > On Thu, Feb 10, 2022 at 12:08:27AM +, Jag Raman wrote:  
> >> 
> >> Thanks for the explanation, Alex. Thanks to everyone else in the thread who
> >> helped to clarify this problem.
> >> 
> >> We have implemented the memory isolation based on the discussion in the
> >> thread. We will send the patches out shortly.
> >> 
> >> Devices such as “name" and “e1000” worked fine. But I’d like to note that
> >> the LSI device (TYPE_LSI53C895A) had some problems - it doesn’t seem
> >> to be IOMMU aware. In LSI’s case, the kernel driver is asking the device to
> >> read instructions from the CPU VA (lsi_execute_script() -> read_dword()),
> >> which is forbidden when IOMMU is enabled. Specifically, the driver is 
> >> asking
> >> the device to access other BAR regions by using the BAR address programmed
> >> in the PCI config space. This happens even without vfio-user patches. For 
> >> example,
> >> we could enable IOMMU using “-device intel-iommu” QEMU option and also
> >> adding the following to the kernel command-line: “intel_iommu=on 
> >> iommu=nopt”.
> >> In this case, we could see an IOMMU fault.  
> > 
> > So, device accessing its own BAR is different. Basically, these
> > transactions never go on the bus at all, never mind get to the IOMMU.  
> 
> Hi Michael,
> 
> In LSI case, I did notice that it went to the IOMMU. The device is reading 
> the BAR
> address as if it was a DMA address.
> 
> > I think it's just used as a handle to address internal device memory.
> > This kind of trick is not universal, but not terribly unusual.
> > 
> >   
> >> Unfortunately, we started off our project with the LSI device. So that 
> >> lead to all the
> >> confusion about what is expected at the server end in-terms of
> >> vectoring/address-translation. It gave an impression as if the request was 
> >> still on
> >> the CPU side of the PCI root complex, but the actual problem was with the
> >> device driver itself.
> >> 
> >> I’m wondering how to deal with this problem. Would it be OK if we mapped 
> >> the
> >> device’s BAR into the IOVA, at the same CPU VA programmed in the BAR 
> >> registers?
> >> This would help devices such as LSI to circumvent this problem. One problem
> >> with this approach is that it has the potential to collide with another 
> >> legitimate
> >> IOVA address. Kindly share your thought on this.
> >> 
> >> Thank you!  
> > 
> > I am not 100% sure what do you plan to do but it sounds fine since even
> > if it collides, with traditional PCI device must never initiate cycles  
> 
> OK sounds good, I’ll create a mapping of the device BARs in the IOVA.

I don't think this is correct.  Look for instance at ACPI _TRA support
where a system can specify a translation offset such that, for example,
a CPU access to a device is required to add the provided offset to the
bus address of the device.  A system using this could have multiple
root bridges, where each is given the same, overlapping MMIO aperture.
>From the processor perspective, each MMIO range is unique and possibly
none of those devices have a zero _TRA, there could be system memory at
the equivalent flat memory address.

So if the transaction actually hits this bus, which I think is what
making use of the device AddressSpace implies, I don't think it can
assume that it's simply reflected back at itself.  Conventional PCI and
PCI Express may be software compatible, but there's a reason we don't
see IOMMUs that provide both translation and isolation in conventional
topologies.

Is this more a bug in the LSI device emulation model?  For instance in
vfio-pci, if I want to access an offset into a BAR from within QEMU, I
don't care what address is programmed into that BAR, I perform an
access relative to the vfio file descriptor region representing that
BAR space.  I'd expect that any viable device emulation model does the
same, an access to device memory uses an offset from an internal
resource, irrespective of the BAR address.

It would seem strange if the driver is actually programming the device
to DMA to itself and if that's actually happening, I'd wonder if this
driver is actually compatible with an IOMMU on bare metal.

> > within their own BAR range, and PCIe is software-compatible with PCI. So
> > devices won't be able to access this IOVA even if it was programmed in
> > the IOMMU.
> > 
> > As was mentioned elsewhere on this thread, devices accessing each
> > other's BAR is a different matter.
> > 
> > I do not remember which rules apply to multiple functions of a
> > multi-function device though. I think in a traditional PCI
> > they will never go out on the bus, but with e.g. SRIOV they
> > would probably do go out? Alex, any idea?

This falls under implementation specific behavior in the spec, IIRC.
This is actually why IOMMU grouping requires ACS support on
multi-function devices to clarify the behavior of p2p between

Re: [PATCH 7/9] user: Declare target-specific prototypes in 'user/cpu-target.h'

2022-02-10 Thread Richard Henderson


On 2/10/22 10:00, Philippe Mathieu-Daudé wrote:

Move user-mode specific prototypes from "exec/exec-all.h"
to "user/cpu-target.h".

Signed-off-by: Philippe Mathieu-Daudé
---


Why a new cpu-target.h, and what is it supposed to mean?  What else is going in there?  It 
all looks cpu_loop related so far.


Why is this separate from the next patch, with "cpu-common.h", which also appears to be 
basically cpu_loop related?



r~

Re: [PATCH 5/9] linux-user/cpu_loop: Add missing 'exec/cpu-all.h' header

2022-02-10 Thread Richard Henderson


On 2/10/22 10:00, Philippe Mathieu-Daudé wrote:

env_cpu() is declared in "exec/cpu-all.h".

Signed-off-by: Philippe Mathieu-Daudé
---
  linux-user/cpu_loop-common.h | 1 +
  1 file changed, 1 insertion(+)


Reviewed-by: Richard Henderson 

r~

Re: [PATCH 6/9] exec: Define MMUAccessType in 'exec/cpu-tlb.h' header

2022-02-10 Thread Richard Henderson


On 2/10/22 10:00, Philippe Mathieu-Daudé wrote:

To reduce the inclusion of "hw/core/cpu.h", extract
MMUAccessType to its own "exec/cpu-tlb.h" header.

Signed-off-by: Philippe Mathieu-Daudé
---


Not keen on the name, unless you plan to put something else in there.


r~

Re: [PATCH 4/9] linux-user/exit: Add missing 'qemu/plugin.h' header

2022-02-10 Thread Richard Henderson


On 2/10/22 10:00, Philippe Mathieu-Daudé wrote:

qemu_plugin_user_exit() is declared in "qemu/plugin.h".

Signed-off-by: Philippe Mathieu-Daudé
---
  linux-user/exit.c | 1 +
  1 file changed, 1 insertion(+)


Reviewed-by: Richard Henderson 

r~

Re: [PATCH 3/9] include: Move exec/user/ to user/

2022-02-10 Thread Richard Henderson


On 2/10/22 10:00, Philippe Mathieu-Daudé wrote:

Avoid spreading the headers in multiple directories,
unify exec/user/ and user/.

Signed-off-by: Philippe Mathieu-Daudé
---
  bsd-user/qemu.h | 4 ++--
  include/exec/cpu-all.h  | 2 +-
  include/{exec => }/user/abitypes.h  | 0
  include/user/safe-syscall.h | 6 +++---
  include/{exec => }/user/thunk.h | 2 +-
  linux-user/qemu.h   | 2 +-
  linux-user/thunk.c  | 2 +-
  linux-user/user-internals.h | 2 +-
  scripts/coverity-scan/COMPONENTS.md | 2 +-
  9 files changed, 11 insertions(+), 11 deletions(-)
  rename include/{exec => }/user/abitypes.h (100%)
  rename include/{exec => }/user/thunk.h (99%)


Reviewed-by: Richard Henderson 

Something I noticed in passing: abitypes.h doesn't need all of cpu.h, only 
cpu-param.h.


r~

Re: [PATCH 2/9] coverity-scan: Cover common-user/

2022-02-10 Thread Richard Henderson


On 2/10/22 10:00, Philippe Mathieu-Daudé wrote:

common-user/ has been added in commit bbf15aaf7c
("common-user: Move safe-syscall.* from linux-user").

Signed-off-by: Philippe Mathieu-Daudé
---
  scripts/coverity-scan/COMPONENTS.md | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)


Reviewed-by: Richard Henderson 

r~

Re: [RFC PATCH 0/3] tests/tcg/ppc64le: fix the build of TCG tests with Clang

2022-02-10 Thread Richard Henderson


On 2/9/22 07:31, matheus.fe...@eldorado.org.br wrote:

The second patch addresses differences in the output of float_madds.c.
The __builtin_fmaf used in this test emits fmadds with GCC and xsmaddasp
with LLVM. The first insn had rounding errors fixed in
d04ca895dc7f ("target/ppc: Add helpers for fmadds et al"), we apply
a similar fix to xsmaddasp.


Thanks for this.  I missed those before.

There are a number of other missed vector cases, which you may have seen.
Basically, anything with "r2sp" needs updating.


r~

Re: [RFC PATCH 2/3] target/ppc: change xs[n]madd[am]sp to use float64r32_muladd

2022-02-10 Thread Richard Henderson


On 2/9/22 07:31, matheus.fe...@eldorado.org.br wrote:

From: Matheus Ferst

Change VSX Scalar Multiply-Add/Subtract Type-A/M Single Precision
helpers to use float64r32_muladd. This method should correctly handle
all rounding modes, so the workaround for float_round_nearest_even can
be dropped.

Signed-off-by: Matheus Ferst
---
  target/ppc/fpu_helper.c | 54 +++--
  1 file changed, 19 insertions(+), 35 deletions(-)


Reviewed-by: Richard Henderson 

r~

Re: [PATCH] hw/arm/armv7m: Handle disconnected clock inputs

2022-02-10 Thread Richard Henderson


On 2/9/22 04:16, Peter Maydell wrote:

In the armv7m object, handle clock inputs that aren't connected.
This is always an error for 'cpuclk'. For 'refclk' it is OK for this
to be disconnected, but we need to handle it by not trying to connect
a sourceless-clock to the systick device.

This fixes a bug where on the mps2-an521 and similar boards (which
do not have a refclk) the systick device incorrectly reset with
SYST_CSR.CLKSOURCE 0 ("use refclk") rather than 1 ("use CPU clock").

Cc:qemu-sta...@nongnu.org
Reported-by: Richard Petri
Signed-off-by: Peter Maydell
---
The other option would be to have clock_has_source() look not
just at clk->source but somehow walk up the clock tree to see
if it can find something that looks like a "root". That seems
overcomplicated...
---
  hw/arm/armv7m.c | 26 ++
  1 file changed, 22 insertions(+), 4 deletions(-)


Reviewed-by: Richard Henderson 

r~

Re: [PATCH v5 03/18] pci: isolated address space for PCI bus

2022-02-10 Thread Michael S. Tsirkin

On Thu, Feb 10, 2022 at 10:23:01PM +, Jag Raman wrote:
> 
> 
> > On Feb 10, 2022, at 3:02 AM, Michael S. Tsirkin  wrote:
> > 
> > On Thu, Feb 10, 2022 at 12:08:27AM +, Jag Raman wrote:
> >> 
> >> 
> >>> On Feb 2, 2022, at 12:34 AM, Alex Williamson  
> >>> wrote:
> >>> 
> >>> On Wed, 2 Feb 2022 01:13:22 +
> >>> Jag Raman  wrote:
> >>> 
> > On Feb 1, 2022, at 5:47 PM, Alex Williamson 
> >  wrote:
> > 
> > On Tue, 1 Feb 2022 21:24:08 +
> > Jag Raman  wrote:
> > 
> >>> On Feb 1, 2022, at 10:24 AM, Alex Williamson 
> >>>  wrote:
> >>> 
> >>> On Tue, 1 Feb 2022 09:30:35 +
> >>> Stefan Hajnoczi  wrote:
> >>> 
>  On Mon, Jan 31, 2022 at 09:16:23AM -0700, Alex Williamson wrote:
> > On Fri, 28 Jan 2022 09:18:08 +
> > Stefan Hajnoczi  wrote:
> > 
> >> On Thu, Jan 27, 2022 at 02:22:53PM -0700, Alex Williamson wrote:   
> >>
> >>> If the goal here is to restrict DMA between devices, ie. 
> >>> peer-to-peer
> >>> (p2p), why are we trying to re-invent what an IOMMU already does? 
> >>>
> >> 
> >> The issue Dave raised is that vfio-user servers run in separate
> >> processses from QEMU with shared memory access to RAM but no direct
> >> access to non-RAM MemoryRegions. The virtiofs DAX Window BAR is one
> >> example of a non-RAM MemoryRegion that can be the source/target of 
> >> DMA
> >> requests.
> >> 
> >> I don't think IOMMUs solve this problem but luckily the vfio-user
> >> protocol already has messages that vfio-user servers can use as a
> >> fallback when DMA cannot be completed through the shared memory RAM
> >> accesses.
> >> 
> >>> In
> >>> fact, it seems like an IOMMU does this better in providing an IOVA
> >>> address space per BDF.  Is the dynamic mapping overhead too much? 
> >>>  What
> >>> physical hardware properties or specifications could we leverage 
> >>> to
> >>> restrict p2p mappings to a device?  Should it be governed by 
> >>> machine
> >>> type to provide consistency between devices?  Should each 
> >>> "isolated"
> >>> bus be in a separate root complex?  Thanks,
> >> 
> >> There is a separate issue in this patch series regarding isolating 
> >> the
> >> address space where BAR accesses are made (i.e. the global
> >> address_space_memory/io). When one process hosts multiple vfio-user
> >> server instances (e.g. a software-defined network switch with 
> >> multiple
> >> ethernet devices) then each instance needs isolated memory and io 
> >> address
> >> spaces so that vfio-user clients don't cause collisions when they 
> >> map
> >> BARs to the same address.
> >> 
> >> I think the the separate root complex idea is a good solution. This
> >> patch series takes a different approach by adding the concept of
> >> isolated address spaces into hw/pci/.  
> > 
> > This all still seems pretty sketchy, BARs cannot overlap within the
> > same vCPU address space, perhaps with the exception of when they're
> > being sized, but DMA should be disabled during sizing.
> > 
> > Devices within the same VM context with identical BARs would need to
> > operate in different address spaces.  For example a translation 
> > offset
> > in the vCPU address space would allow unique addressing to the 
> > devices,
> > perhaps using the translation offset bits to address a root complex 
> > and
> > masking those bits for downstream transactions.
> > 
> > In general, the device simply operates in an address space, ie. an
> > IOVA.  When a mapping is made within that address space, we perform 
> > a
> > translation as necessary to generate a guest physical address.  The
> > IOVA itself is only meaningful within the context of the address 
> > space,
> > there is no requirement or expectation for it to be globally unique.
> > 
> > If the vfio-user server is making some sort of requirement that 
> > IOVAs
> > are unique across all devices, that seems very, very wrong.  
> > Thanks,  
>  
>  Yes, BARs and IOVAs don't need to be unique across all devices.
>  
>  The issue is that there can be as many guest physical address spaces 
>  as
>  there are vfio-user clients connected, so per-client isolated address
>  spaces are required. This patch series has a solution to that problem
>  with the new pci_isol_as_mem/io() API.
> >>> 
> >>> Sorry, this still doesn't

Re: [PATCH v5 03/18] pci: isolated address space for PCI bus

2022-02-10 Thread Jag Raman



> On Feb 10, 2022, at 3:02 AM, Michael S. Tsirkin  wrote:
> 
> On Thu, Feb 10, 2022 at 12:08:27AM +, Jag Raman wrote:
>> 
>> 
>>> On Feb 2, 2022, at 12:34 AM, Alex Williamson  
>>> wrote:
>>> 
>>> On Wed, 2 Feb 2022 01:13:22 +
>>> Jag Raman  wrote:
>>> 
> On Feb 1, 2022, at 5:47 PM, Alex Williamson  
> wrote:
> 
> On Tue, 1 Feb 2022 21:24:08 +
> Jag Raman  wrote:
> 
>>> On Feb 1, 2022, at 10:24 AM, Alex Williamson 
>>>  wrote:
>>> 
>>> On Tue, 1 Feb 2022 09:30:35 +
>>> Stefan Hajnoczi  wrote:
>>> 
 On Mon, Jan 31, 2022 at 09:16:23AM -0700, Alex Williamson wrote:
> On Fri, 28 Jan 2022 09:18:08 +
> Stefan Hajnoczi  wrote:
> 
>> On Thu, Jan 27, 2022 at 02:22:53PM -0700, Alex Williamson wrote: 
>>  
>>> If the goal here is to restrict DMA between devices, ie. 
>>> peer-to-peer
>>> (p2p), why are we trying to re-invent what an IOMMU already does?   
>>>  
>> 
>> The issue Dave raised is that vfio-user servers run in separate
>> processses from QEMU with shared memory access to RAM but no direct
>> access to non-RAM MemoryRegions. The virtiofs DAX Window BAR is one
>> example of a non-RAM MemoryRegion that can be the source/target of 
>> DMA
>> requests.
>> 
>> I don't think IOMMUs solve this problem but luckily the vfio-user
>> protocol already has messages that vfio-user servers can use as a
>> fallback when DMA cannot be completed through the shared memory RAM
>> accesses.
>> 
>>> In
>>> fact, it seems like an IOMMU does this better in providing an IOVA
>>> address space per BDF.  Is the dynamic mapping overhead too much?  
>>> What
>>> physical hardware properties or specifications could we leverage to
>>> restrict p2p mappings to a device?  Should it be governed by machine
>>> type to provide consistency between devices?  Should each "isolated"
>>> bus be in a separate root complex?  Thanks,
>> 
>> There is a separate issue in this patch series regarding isolating 
>> the
>> address space where BAR accesses are made (i.e. the global
>> address_space_memory/io). When one process hosts multiple vfio-user
>> server instances (e.g. a software-defined network switch with 
>> multiple
>> ethernet devices) then each instance needs isolated memory and io 
>> address
>> spaces so that vfio-user clients don't cause collisions when they map
>> BARs to the same address.
>> 
>> I think the the separate root complex idea is a good solution. This
>> patch series takes a different approach by adding the concept of
>> isolated address spaces into hw/pci/.  
> 
> This all still seems pretty sketchy, BARs cannot overlap within the
> same vCPU address space, perhaps with the exception of when they're
> being sized, but DMA should be disabled during sizing.
> 
> Devices within the same VM context with identical BARs would need to
> operate in different address spaces.  For example a translation offset
> in the vCPU address space would allow unique addressing to the 
> devices,
> perhaps using the translation offset bits to address a root complex 
> and
> masking those bits for downstream transactions.
> 
> In general, the device simply operates in an address space, ie. an
> IOVA.  When a mapping is made within that address space, we perform a
> translation as necessary to generate a guest physical address.  The
> IOVA itself is only meaningful within the context of the address 
> space,
> there is no requirement or expectation for it to be globally unique.
> 
> If the vfio-user server is making some sort of requirement that IOVAs
> are unique across all devices, that seems very, very wrong.  Thanks,  
> 
 
 Yes, BARs and IOVAs don't need to be unique across all devices.
 
 The issue is that there can be as many guest physical address spaces as
 there are vfio-user clients connected, so per-client isolated address
 spaces are required. This patch series has a solution to that problem
 with the new pci_isol_as_mem/io() API.
>>> 
>>> Sorry, this still doesn't follow for me.  A server that hosts multiple
>>> devices across many VMs (I'm not sure if you're referring to the device
>>> or the VM as a client) needs to deal with different address spaces per
>>> device.  The server needs to be able to uniquely identify every DMA,
>>> which must be part of the interface protocol.  But I don't see how that
>>> impos

Re: [PATCH v3] target/riscv: Enable Zicbo[m,z,p] instructions

2022-02-10 Thread Richard Henderson


On 2/11/22 03:48, Philipp Tomsich wrote:

-lq          . 010 . 000 @i
+{
+  [
+    # *** RV32 Zicbom Standard Extension ***
+    cbo_clean  000 1 . 010 0 000 @sfence_vm
+    cbo_flush  000 00010 . 010 0 000 @sfence_vm
+    cbo_inval  000 0 . 010 0 000 @sfence_vm
+
+    # *** RV32 Zicboz Standard Extension ***
+    cbo_zero   000 00100 . 010 0 000 @sfence_vm
+  ]
+
+  # *** RVI128 lq ***
+  lq          . 010 . 000 @i
+}


...

+#define REQUIRE_ZICBOM(ctx) do {               \
+    if (!RISCV_CPU(ctx->cs)->cfg.ext_icbom) {  \
+        return false;                          \
+    }                                          \
+} while (0)


The exception semantics seem to be broken here: if Zicbom is not implemented, but the 
requirements for lq (i.e. rv128) are satisfied, then this needs to be passed on to lq: "lq 
zero, 0(rs1)" is still expected to raise exceptions based on the permissions for the 
address at 0(rs1).


There are multiple ways to do this, including:
1) perform a tail-call to trans_lq, in case Zicbom is not enabled (instead of just 
returning false);
2) use the table-based dispatch (added for XVentanaCondOps) and hook a Zicbom 
disptacher before the RVI dispatcher: if Zicbom then falls through, the RVI dispatcher 
would drop into trans_lq;


No, returning false will cause the next pattern in the { } group to be matched.  No need 
for other workarounds.



r~

[PATCH] mps3-an547: Add missing user ahb interfaces

2022-02-10 Thread Jimmy Brisson

With these interfaces missing, TFM would delegate peripherals 0, 1,
2, 3 and 8, and qemu would ignore the delegation of interface 8, as
it thought interface 4 was eth & USB.

This patch corrects this behavior and allows TFM to delegate the
eth & USB peripheral to NS mode.

Signed-off-by: Jimmy Brisson 
---
 hw/arm/mps2-tz.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/hw/arm/mps2-tz.c b/hw/arm/mps2-tz.c
index f40e854dec..e287ad4d06 100644
--- a/hw/arm/mps2-tz.c
+++ b/hw/arm/mps2-tz.c
@@ -1078,6 +1078,10 @@ static void mps2tz_common_init(MachineState *machine)
 { "gpio1", make_unimp_dev, &mms->gpio[1], 0x41101000, 0x1000 },
 { "gpio2", make_unimp_dev, &mms->gpio[2], 0x41102000, 0x1000 },
 { "gpio3", make_unimp_dev, &mms->gpio[3], 0x41103000, 0x1000 },
+{ /* port 4 USER AHB interface 0 */ },
+{ /* port 5 USER AHB interface 1 */ },
+{ /* port 6 USER AHB interface 2 */ },
+{ /* port 7 USER AHB interface 3 */ },
 { "eth-usb", make_eth_usb, NULL, 0x4140, 0x20, { 49 } 
},
 },
 },
-- 
2.33.1

1 2 3 4 >

1 - 100 of 334 matches

Mail list logo