date:20220126

Re: [PATCH v2 32/40] bsd-user/signal.c: handle_pending_signal

2022-01-26 Thread Richard Henderson


On 1/25/22 12:29, Warner Losh wrote:

Handle a queued signal.

Signed-off-by: Stacey Son
Signed-off-by: Kyle Evans
Signed-off-by: Warner Losh
---
  bsd-user/qemu.h   |  7 
  bsd-user/signal.c | 87 +++
  2 files changed, 94 insertions(+)


Reviewed-by: Richard Henderson 

r~

Re: [PATCH v2 30/40] bsd-user/signal.c: sigset manipulation routines.

2022-01-26 Thread Richard Henderson


On 1/25/22 12:29, Warner Losh wrote:

target_sigemptyset: resets a set to having no bits set
target_sigaddset:   adds a signal to a set
target_sigismember: returns true when signal is a member
host_to_target_sigset_internal: convert host sigset to target
host_to_target_sigset: convert host sigset to target
target_to_host_sigset_internal: convert target sigset to host
target_to_host_sigset: convert target sigset to host

Signed-off-by: Stacey Son
Signed-off-by: Kyle Evans
Signed-off-by: Warner Losh
---
  bsd-user/signal-common.h |  2 ++
  bsd-user/signal.c| 74 
  2 files changed, 76 insertions(+)


Reviewed-by: Richard Henderson 

r~

Re: [PATCH v2 29/40] bsd-user/signal.c: Fill in queue_signal

2022-01-26 Thread Richard Henderson


On 1/25/22 12:29, Warner Losh wrote:

Fill in queue signal implementation, as well as routines allocate and
delete elements of the signal queue.

Signed-off-by: Stacey Son 
Signed-off-by: Kyle Evans 
Signed-off-by: Warner Losh 
---
  bsd-user/qemu.h   |  1 +
  bsd-user/signal.c | 13 -
  2 files changed, 13 insertions(+), 1 deletion(-)


Reviewed-by: Richard Henderson 


+struct emulated_sigtable sync_signal;
  struct emulated_sigtable sigtab[TARGET_NSIG];


I'll note that we don't need an array of these, since we block all signals while returning 
to the main cpu loop, so we can't receive a second async signal.  Something to be fixed 
for both l-user and b-user later...



r~

Re: [PATCH v2 31/40] bsd-user/signal.c: setup_frame

2022-01-26 Thread Richard Henderson


On 1/25/22 12:29, Warner Losh wrote:

setup_frame sets up a signalled stack frame. Associated routines to
extract the pointer to the stack frame and to support alternate stacks.

Signed-off-by: Stacey Son
Signed-off-by: Kyle Evans
Signed-off-by: Warner Losh
---
  bsd-user/main.c   |  5 +++
  bsd-user/qemu.h   |  3 +-
  bsd-user/signal.c | 82 +++
  3 files changed, 89 insertions(+), 1 deletion(-)


Reviewed-by: Richard Henderson 



+#if defined(TARGET_ARM)
+return (sp - frame_size) & ~7;
+#elif defined(TARGET_AARCH64)
+return (sp - frame_size) & ~15;
+#else
+return sp - frame_size;
+#endif


Just double-checking that this is still in the cleanup queue.
I would expect x86 to require 16 byte alignment as well, for sse.


r~

Re: [PATCH v2 19/40] bsd-user/host/arm/host-signal.h: Implement host_signal_*

2022-01-26 Thread Richard Henderson


On 1/25/22 12:29, Warner Losh wrote:

Implement host_signal_pc, host_signal_set_pc and host_signal_write for
arm.

Signed-off-by: Kyle Evans
Signed-off-by: Warner Losh
---
  bsd-user/host/arm/host-signal.h | 35 +
  1 file changed, 35 insertions(+)
  create mode 100644 bsd-user/host/arm/host-signal.h


Reviewed-by: Richard Henderson 

r~

Re: [PATCH v6 3/7] qapi/commands: Optionally generate trace for QMP commands

2022-01-26 Thread Markus Armbruster

Vladimir Sementsov-Ogievskiy  writes:

> Add trace generation disabled by default and new option --gen-trace to
> enable it.  The next commit will enable it for qapi/, but not for qga/
> and tests/.  Making it work for the latter two would involve some Meson
> hackery to ensure we generate the trace-events files before trace-tool
> uses them.  Since we don't actually support tracing there, we'll bypass
> that problem.
>
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> Reviewed-by: Stefan Hajnoczi 
> ---
>  scripts/qapi/commands.py | 91 +++-
>  scripts/qapi/main.py | 14 +--
>  2 files changed, 91 insertions(+), 14 deletions(-)
>
> diff --git a/scripts/qapi/commands.py b/scripts/qapi/commands.py
> index 17e5ed2414..fa90b6246b 100644
> --- a/scripts/qapi/commands.py
> +++ b/scripts/qapi/commands.py

[...]

> @@ -265,6 +319,17 @@ def _begin_user_module(self, name: str) -> None:
>  
>  ''',
>   commands=commands, visit=visit))
> +
> +if self._gen_tracing and commands != 'qapi-commands':
> +self._genc.add(mcgen('''
> +#include "trace/trace-qapi.h"

I believe this include is superfluous.

> +#include "qapi/qmp/qjson.h"
> +#include "trace/trace-%(nm)s_trace_events.h"
> +''',
> + nm=c_name(commands, protect=False)))
> +# We use c_name(commands, protect=False) to turn '-' into '_', to
> +# match .underscorify() in trace/meson.build
> +
>  self._genh.add(mcgen('''
>  #include "%(types)s.h"
>  

[...]

Re: [PATCH v2 26/40] bsd-user/signal.c: Implement host_signal_handler

2022-01-26 Thread Richard Henderson


On 1/25/22 12:29, Warner Losh wrote:

Implement host_signal_handler to handle signals generated by the host
and to do safe system calls.

Signed-off-by: Stacey Son
Signed-off-by: Kyle Evans
Signed-off-by: Warner Losh
---
  bsd-user/signal.c | 105 ++
  1 file changed, 105 insertions(+)


Reviewed-by: Richard Henderson 

r~

Re: [PATCH v6 0/7] trace qmp commands

2022-01-26 Thread Markus Armbruster

Vladimir Sementsov-Ogievskiy  writes:

> This series aims to add trace points for each qmp command with help of
> qapi code generator.

Logging QMP traffic has worked well enough for me, but I can understand
why you'd like to use tracing.  Other uses of tracing might come up in
the future.

I found a few last nits to pick.  Happy to address them in my tree.

Except for PATCH 4
Reviewed-by: Markus Armbruster

[PATCH 3/4] target/arm: Fix {fp, sve}_exception_el for VHE mode running

2022-01-26 Thread Richard Henderson

When HCR_EL2.E2H is set, the format of CPTR_EL2 changes to
look more like CPACR_EL1, with ZEN and FPEN fields instead
of TZ and TFP fields.

Reported-by: Zenghui Yu 
Signed-off-by: Richard Henderson 
---
 target/arm/helper.c | 77 +++--
 1 file changed, 60 insertions(+), 17 deletions(-)

diff --git a/target/arm/helper.c b/target/arm/helper.c
index cd48560786..ca916139e8 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -6180,15 +6180,41 @@ int sve_exception_el(CPUARMState *env, int el)
 }
 }
 
-/* CPTR_EL2.  Since TZ and TFP are positive,
- * they will be zero when EL2 is not present.
+/*
+ * CPTR_EL2 changes format with HCR_EL2.E2H (regardless of TGE).
  */
-if (el <= 2 && arm_is_el2_enabled(env)) {
-if (env->cp15.cptr_el[2] & CPTR_TZ) {
-return 2;
-}
-if (env->cp15.cptr_el[2] & CPTR_TFP) {
-return 0;
+if (el <= 2) {
+if (hcr_el2 & HCR_E2H) {
+/* Check CPTR_EL2.ZEN.  */
+switch (extract32(env->cp15.cptr_el[2], 16, 2)) {
+case 1:
+if (el != 0 || !(hcr_el2 & HCR_TGE)) {
+break;
+}
+/* fall through */
+case 0:
+case 2:
+return 2;
+}
+
+/* Check CPTR_EL2.FPEN.  */
+switch (extract32(env->cp15.cptr_el[2], 20, 2)) {
+case 1:
+if (el == 2 || !(hcr_el2 & HCR_TGE)) {
+break;
+}
+/* fall through */
+case 0:
+case 2:
+return 0;
+}
+} else if (arm_is_el2_enabled(env)) {
+if (env->cp15.cptr_el[2] & CPTR_TZ) {
+return 2;
+}
+if (env->cp15.cptr_el[2] & CPTR_TFP) {
+return 0;
+}
 }
 }
 
@@ -12909,6 +12935,8 @@ uint32_t HELPER(crc32c)(uint32_t acc, uint32_t val, 
uint32_t bytes)
 int fp_exception_el(CPUARMState *env, int cur_el)
 {
 #ifndef CONFIG_USER_ONLY
+uint64_t hcr_el2;
+
 /* CPACR and the CPTR registers don't exist before v6, so FP is
  * always accessible
  */
@@ -12932,13 +12960,15 @@ int fp_exception_el(CPUARMState *env, int cur_el)
 return 0;
 }
 
+hcr_el2 = arm_hcr_el2_eff(env);
+
 /* The CPACR controls traps to EL1, or PL1 if we're 32 bit:
  * 0, 2 : trap EL0 and EL1/PL1 accesses
  * 1: trap only EL0 accesses
  * 3: trap no accesses
  * This register is ignored if E2H+TGE are both set.
  */
-if ((arm_hcr_el2_eff(env) & (HCR_E2H | HCR_TGE)) != (HCR_E2H | HCR_TGE)) {
+if ((hcr_el2 & (HCR_E2H | HCR_TGE)) != (HCR_E2H | HCR_TGE)) {
 int fpen = extract32(env->cp15.cpacr_el1, 20, 2);
 
 switch (fpen) {
@@ -12979,15 +13009,28 @@ int fp_exception_el(CPUARMState *env, int cur_el)
 }
 }
 
-/* For the CPTR registers we don't need to guard with an ARM_FEATURE
- * check because zero bits in the registers mean "don't trap".
+/*
+ * CPTR_EL2 is present in v7VE or v8, and changes format
+ * with HCR_EL2.E2H (regardless of TGE).
  */
-
-/* CPTR_EL2 : present in v7VE or v8 */
-if (cur_el <= 2 && extract32(env->cp15.cptr_el[2], 10, 1)
-&& arm_is_el2_enabled(env)) {
-/* Trap FP ops at EL2, NS-EL1 or NS-EL0 to EL2 */
-return 2;
+if (cur_el <= 2) {
+if (hcr_el2 & HCR_E2H) {
+/* Check CPTR_EL2.FPEN.  */
+switch (extract32(env->cp15.cptr_el[2], 20, 2)) {
+case 1:
+if (cur_el != 0 || !(hcr_el2 & HCR_TGE)) {
+break;
+}
+/* fall through */
+case 0:
+case 2:
+return 2;
+}
+} else if (arm_is_el2_enabled(env)) {
+if (env->cp15.cptr_el[2] & CPTR_TFP) {
+return 2;
+}
+}
 }
 
 /* CPTR_EL3 : present in v8 */
-- 
2.25.1

[PATCH 4/4] target/arm: Use CPTR_TFP with CPTR_EL3 in fp_exception_el

2022-01-26 Thread Richard Henderson

Use the named bit rather than a bare extract32.

Signed-off-by: Richard Henderson 
---
 target/arm/helper.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/arm/helper.c b/target/arm/helper.c
index ca916139e8..5610879680 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -13034,7 +13034,7 @@ int fp_exception_el(CPUARMState *env, int cur_el)
 }
 
 /* CPTR_EL3 : present in v8 */
-if (extract32(env->cp15.cptr_el[3], 10, 1)) {
+if (env->cp15.cptr_el[3] & CPTR_TFP) {
 /* Trap all FP ops to EL3 */
 return 3;
 }
-- 
2.25.1

Re: [PATCH v6 6/7] meson: document, why we don't generate trace events for tests/ and qga/

2022-01-26 Thread Markus Armbruster

Nitpick: drop the comma in the title.

[PATCH 2/4] target/arm: Tidy sve_exception_el for CPACR_EL1 access

2022-01-26 Thread Richard Henderson

Extract entire fields for ZEN and FPEN, rather than testing specific bits.
This makes it easier to follow the code versus the ARM spec.

Signed-off-by: Richard Henderson 
---
 target/arm/helper.c | 36 +---
 1 file changed, 17 insertions(+), 19 deletions(-)

diff --git a/target/arm/helper.c b/target/arm/helper.c
index d9ee2ba5f4..cd48560786 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -6154,30 +6154,28 @@ int sve_exception_el(CPUARMState *env, int el)
 uint64_t hcr_el2 = arm_hcr_el2_eff(env);
 
 if (el <= 1 && (hcr_el2 & (HCR_E2H | HCR_TGE)) != (HCR_E2H | HCR_TGE)) {
-bool disabled = false;
-
-/* The CPACR.ZEN controls traps to EL1:
- * 0, 2 : trap EL0 and EL1 accesses
- * 1: trap only EL0 accesses
- * 3: trap no accesses
- */
-if (!extract32(env->cp15.cpacr_el1, 16, 1)) {
-disabled = true;
-} else if (!extract32(env->cp15.cpacr_el1, 17, 1)) {
-disabled = el == 0;
-}
-if (disabled) {
+/* Check CPACR.ZEN.  */
+switch (extract32(env->cp15.cpacr_el1, 16, 2)) {
+case 1:
+if (el != 0) {
+break;
+}
+/* fall through */
+case 0:
+case 2:
 /* route_to_el2 */
 return hcr_el2 & HCR_TGE ? 2 : 1;
 }
 
 /* Check CPACR.FPEN.  */
-if (!extract32(env->cp15.cpacr_el1, 20, 1)) {
-disabled = true;
-} else if (!extract32(env->cp15.cpacr_el1, 21, 1)) {
-disabled = el == 0;
-}
-if (disabled) {
+switch (extract32(env->cp15.cpacr_el1, 20, 2)) {
+case 1:
+if (el != 0) {
+break;
+}
+/* fall through */
+case 0:
+case 2:
 return 0;
 }
 }
-- 
2.25.1

Re: [PATCH v2 18/40] bsd-user/signal.c: Add si_type argument to queue_signal

2022-01-26 Thread Richard Henderson


On 1/25/22 12:29, Warner Losh wrote:

Mirror the linux-user practice and add a si_type argument to queue
signal. This will be transported as the upper 8 bits in the si_type
element of siginfo so that we know what bits of the structure are valid
and so we can properly implement host_to_target_siginfo_noswap and
tswap_siginfo. Adapt the one caller of queue_signal to the new
interface.  Use all the same names as Linux (except _RT which we don't
treat differently, unlike Linux), though some are unused. Place this
into signal-common.h since that's a better place given bsd-user's
structure. Move prototype of queue_signal to signal-common.h to mirror
linux-user's location.

Signed-off-by: Warner Losh
---
  bsd-user/signal-common.h | 26 +-
  bsd-user/signal.c|  5 +++--
  2 files changed, 28 insertions(+), 3 deletions(-)


Reviewed-by: Richard Henderson 

r~

[PATCH 0/4] target/arm: SVE fixes versus VHE

2022-01-26 Thread Richard Henderson

Fix two problems described in

https://lore.kernel.org/qemu-devel/6cdfd5de-2465-adca-73b3-9c66945cf...@huawei.com/

with some other minor code cleanup.


r~


Richard Henderson (4):
  target/arm: Fix sve_zcr_len_for_el for VHE mode running
  target/arm: Tidy sve_exception_el for CPACR_EL1 access
  target/arm: Fix {fp,sve}_exception_el for VHE mode running
  target/arm: Use CPTR_TFP with CPTR_EL3 in fp_exception_el

 target/arm/helper.c | 118 ++--
 1 file changed, 80 insertions(+), 38 deletions(-)

-- 
2.25.1

[PATCH 1/4] target/arm: Fix sve_zcr_len_for_el for VHE mode running

2022-01-26 Thread Richard Henderson

When HCR_EL2.{E2H,TGE} == '11', ZCR_EL1 is unused.

Reported-by: Zenghui Yu 
Signed-off-by: Richard Henderson 
---
 target/arm/helper.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/target/arm/helper.c b/target/arm/helper.c
index cfca0f5ba6..d9ee2ba5f4 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -6225,7 +6225,8 @@ uint32_t sve_zcr_len_for_el(CPUARMState *env, int el)
 ARMCPU *cpu = env_archcpu(env);
 uint32_t zcr_len = cpu->sve_max_vq - 1;
 
-if (el <= 1) {
+if (el <= 1 &&
+(arm_hcr_el2_eff(env) & (HCR_E2H | HCR_TGE)) != (HCR_E2H | HCR_TGE)) {
 zcr_len = MIN(zcr_len, 0xf & (uint32_t)env->vfp.zcr_el[1]);
 }
 if (el <= 2 && arm_feature(env, ARM_FEATURE_EL2)) {
-- 
2.25.1

Re: [PATCH v2 16/40] bsd-user/signal.c: implement abstract target / host signal translation

2022-01-26 Thread Richard Henderson


On 1/25/22 12:29, Warner Losh wrote:

Implement host_to_target_signal and target_to_host_signal.

Signed-off-by: Stacey Son
Signed-off-by: Kyle Evans
Signed-off-by: Warner Losh
Reviewed-by: Peter Maydell
---
  bsd-user/signal-common.h |  2 ++
  bsd-user/signal.c| 16 
  2 files changed, 18 insertions(+)


Reviewed-by: Richard Henderson 

r~

Re: [PATCH v2 14/40] bsd-user/arm/target_arch_cpu.h: Use force_sig_fault for EXCP_UDEF

2022-01-26 Thread Richard Henderson


On 1/25/22 12:29, Warner Losh wrote:

+case EXCP_NOCP:
+case EXCP_INVSTATE:
+/*
+ * See arm/arm/undefined.c undefinedinstruction();
+ *
+ * A number of details aren't emulated (they likely don't matter):
+ * o Misaligned PC generates ILL_ILLADR


As I mentioned, misaligned pc will not come here for qemu.
In the Arm ARM, see aarch32/functions/registers/BXWritePC:

// For branches to an unaligned PC counter in A32 state, the processor takes 
the branch
// and does one of:
// * Forces the address to be aligned
// * Leaves the PC unaligned, meaning the target generates a PC Alignment fault.

The hardware will either refuse to allow bit 1 to be set when bit 0 is clear, OR it will 
generate a PREFETCH_DATA_ABORT for Alignment.


QEMU will do the latter.


Otherwise,
Reviewed-by: Richard Henderson 


r~

Re: [PATCH v2 09/40] bsd-user/signal-common.h: Move signal functions prototypes to here

2022-01-26 Thread Richard Henderson


On 1/25/22 12:29, Warner Losh wrote:

Signed-off-by: Warner Losh
---
  bsd-user/qemu.h  | 8 
  bsd-user/signal-common.h | 6 ++
  2 files changed, 6 insertions(+), 8 deletions(-)


Reviewed-by: Richard Henderson 

r~

Re: [PATCH v2 06/40] bsd-user: Bring in docs from linux-user for signal_pending

2022-01-26 Thread Richard Henderson


On 1/25/22 12:29, Warner Losh wrote:

This is currently unused, so no code adjustments are needed.

Signed-off-by: Warner Losh
---
  bsd-user/qemu.h | 9 -
  1 file changed, 8 insertions(+), 1 deletion(-)


Reviewed-by: Richard Henderson 

r~

Re: [PATCH v2 05/40] bsd-user: Remove vestiges of signal queueing code

2022-01-26 Thread Richard Henderson


On 1/25/22 12:29, Warner Losh wrote:

bsd-user was copied from linux-user at a time when it queued
signals. Remove those vestiges of thse code. Retain the init function,
even though it's now empty since other stuff will likely be added
there. Make it static since it's not called from outside of main.c

Signed-off-by: Warner Losh
---
  bsd-user/main.c |  9 +
  bsd-user/qemu.h | 13 +
  2 files changed, 2 insertions(+), 20 deletions(-)


Reviewed-by: Richard Henderson 

r~

Re: [PATCH v2 04/40] bsd-user/arm/signal.c: get_mcontext should zero vfp data

2022-01-26 Thread Richard Henderson


On 1/25/22 12:29, Warner Losh wrote:

FreeBSD's get_mcontext doesn't return any vfp data. Instead, it zeros
out the vfp feilds (and all the spare fields). Impelement this
behavior. We're still missing the sysarch(ARM_GET_VFPCONTEXT) syscall,
though.

Signed-off-by: Warner Losh 
---
  bsd-user/arm/signal.c | 8 
  1 file changed, 8 insertions(+)

diff --git a/bsd-user/arm/signal.c b/bsd-user/arm/signal.c
index 9026343b478..6eadc6e3c56 100644
--- a/bsd-user/arm/signal.c
+++ b/bsd-user/arm/signal.c
@@ -109,6 +109,14 @@ abi_long get_mcontext(CPUARMState *env, target_mcontext_t 
*mcp, int flags)
  gr[TARGET_REG_LR] = tswap32(env->regs[14]);
  gr[TARGET_REG_PC] = tswap32(env->regs[15]);
  
+/*

+ * FreeBSD's set_mcontext doesn't save VFP info, but blanks it out instead.
+ * Instead, sysarch(ARM_GET_VFPSTATE) is used instead.
+ */


Could be rewritten with fewer "instead".  You wanted get_mcontext.
Otherwise,

Reviewed-by: Richard Henderson 


r~

Re: [PATCH v2 03/40] bsd-user/arm/signal.c: Implement setup_sigframe_arch for arm

2022-01-26 Thread Richard Henderson


On 1/25/22 12:29, Warner Losh wrote:

Fix the broken context setting for arm. FreeBSD's get_mcontext does not
fill in the vfp info. It's filled in in sigframe(). This corresponds to
the new setup_sigframe_arch which fills in mcontext, then adjusts it to
point to the vfp context in the sigframe and fills in that context as
well. Add pointer to where this code is done.

Signed-off-by: Warner Losh 
---
  bsd-user/arm/signal.c   | 50 +++--
  bsd-user/freebsd/target_os_signal.h |  2 +-
  2 files changed, 34 insertions(+), 18 deletions(-)


Reviewed-by: Richard Henderson 


diff --git a/bsd-user/freebsd/target_os_signal.h 
b/bsd-user/freebsd/target_os_signal.h
index 7491629477a..43700d08f71 100644
--- a/bsd-user/freebsd/target_os_signal.h
+++ b/bsd-user/freebsd/target_os_signal.h
@@ -4,7 +4,7 @@
  #include "target_os_siginfo.h"
  #include "target_arch_signal.h"
  
-abi_long setup_sigframe_arch(CPUArchState *regs, abi_ulong frame_addr,

+abi_long setup_sigframe_arch(CPUArchState *env, abi_ulong frame_addr,
   struct target_sigframe *frame, int flags);
  
  /* Compare to sys/signal.h */


Should be in patch 2.


r~

Re: [PATCH v2 02/40] bsd-user: Create setup_sigframe_arch to setup sigframe context

2022-01-26 Thread Richard Henderson


On 1/25/22 12:29, Warner Losh wrote:

Define setup_sigframe_arch whose job it is to setup the mcontext for the
sigframe. Implement for x86 to just call mcontext.

Signed-off-by: Warner Losh
---
  bsd-user/freebsd/target_os_signal.h |  3 +++
  bsd-user/i386/signal.c  | 13 +
  bsd-user/x86_64/signal.c| 13 +
  3 files changed, 29 insertions(+)


Reviewed-by: Richard Henderson 


r~

Re: [PATCH v2 01/40] bsd-user: Complete FreeBSD siginfo

2022-01-26 Thread Richard Henderson


On 1/25/22 12:29, Warner Losh wrote:

Fill in the missing FreeBSD siginfo fields, and add some comments.

Signed-off-by: Warner Losh
---
  bsd-user/freebsd/target_os_siginfo.h | 15 ++-
  1 file changed, 14 insertions(+), 1 deletion(-)


Reviewed-by: Richard Henderson 


r~

[PATCH V2 3/3] migration/migration.c: Remove the MIGRATION_STATUS_ACTIVE when migration finished

2022-01-26 Thread Zhang Chen

The MIGRATION_STATUS_ACTIVE indicates that migration is running.
Remove it to be handled by the default operation,
It should be part of the unknown ending states.

Signed-off-by: Zhang Chen 
Reviewed-by: Juan Quintela 

---
 migration/migration.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 3fac9c67ca..21e1498f46 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3625,12 +3625,6 @@ static void migration_iteration_finish(MigrationState *s)
  "COLO enabled", __func__);
 }
 migrate_start_colo_process(s);
- /* Fallthrough */
-case MIGRATION_STATUS_ACTIVE:
-/*
- * We should really assert here, but since it's during
- * migration, let's try to reduce the usage of assertions.
- */
 s->vm_was_running = true;
 /* Fallthrough */
 case MIGRATION_STATUS_FAILED:
-- 
2.25.1

[PATCH V2 1/3] migration/migration.c: Add missed default error handler for migration state

2022-01-26 Thread Zhang Chen

In the migration_completion() no other status is expected, for
example MIGRATION_STATUS_CANCELLING, MIGRATION_STATUS_CANCELLED, etc.

Signed-off-by: Zhang Chen 
Reviewed-by: Juan Quintela 
Reviewed-by: Dr. David Alan Gilbert 

---
 migration/migration.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index 0652165610..2afa77da03 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3205,7 +3205,7 @@ static void migration_completion(MigrationState *s)
 qemu_mutex_unlock_iothread();
 
 trace_migration_completion_postcopy_end_after_complete();
-} else if (s->state == MIGRATION_STATUS_CANCELLING) {
+} else {
 goto fail;
 }
 
-- 
2.25.1

RE: [PATCH 2/3] migration/migration.c: Avoid COLO boot in postcopy migration

2022-01-26 Thread Zhang, Chen




> -Original Message-
> From: Juan Quintela 
> Sent: Thursday, January 27, 2022 3:30 AM
> To: Zhang, Chen 
> Cc: Dr. David Alan Gilbert ; qemu-dev  de...@nongnu.org>
> Subject: Re: [PATCH 2/3] migration/migration.c: Avoid COLO boot in postcopy
> migration
> 
> Zhang Chen  wrote:
> > COLO dose not support postcopy migration and remove the Fixme.
> >
> > Signed-off-by: Zhang Chen 
> 
> Reviewed-by: Juan Quintela 
> 
> fixed the typo.

OK, already updated V2 for this series.

Thanks
Chen

[PATCH V2 2/3] migration/migration.c: Avoid COLO boot in postcopy migration

2022-01-26 Thread Zhang Chen

COLO does not support postcopy migration and remove the Fixme.

Signed-off-by: Zhang Chen 
Reviewed-by: Juan Quintela 

---
 migration/migration.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 2afa77da03..3fac9c67ca 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3230,7 +3230,11 @@ static void migration_completion(MigrationState *s)
 goto fail_invalidate;
 }
 
-if (!migrate_colo_enabled()) {
+if (migrate_colo_enabled() && s->state == MIGRATION_STATUS_ACTIVE) {
+/* COLO dose not support postcopy */
+migrate_set_state(>state, MIGRATION_STATUS_ACTIVE,
+  MIGRATION_STATUS_COLO);
+} else {
 migrate_set_state(>state, current_active_state,
   MIGRATION_STATUS_COMPLETED);
 }
@@ -3621,10 +3625,6 @@ static void migration_iteration_finish(MigrationState *s)
  "COLO enabled", __func__);
 }
 migrate_start_colo_process(s);
-/*
- * Fixme: we will run VM in COLO no matter its old running state.
- * After exited COLO, we will keep running.
- */
  /* Fallthrough */
 case MIGRATION_STATUS_ACTIVE:
 /*
-- 
2.25.1

[PATCH V2 0/3] Some minor fixes for migration states

2022-01-26 Thread Zhang Chen

This series solved some fixme and comments in code.
Please see the details in each patch commit message.

 --V2:
   -Fix typo.


Zhang Chen (3):
  migration/migration.c: Add missed default error handler for migration
state
  migration/migration.c: Avoid COLO boot in postcopy migration
  migration/migration.c: Remove the MIGRATION_STATUS_ACTIVE when
migration finished

 migration/migration.c | 18 ++
 1 file changed, 6 insertions(+), 12 deletions(-)

-- 
2.25.1

Re: [PATCH] hw/sensor: Add SB-TSI Temperature Sensor Interface

2022-01-26 Thread Hao Wu

Hi,

Sorry for the late reply. I'm not sure what "auto-increment" means here.
The kernel driver at
https://lkml.org/lkml/2020/12/11/968
only has byte-size read/write operations, so I don't see a problem here.
The values are extracted
from the datasheet.

On Tue, Jan 18, 2022 at 9:10 AM Patrick Venture  wrote:

>
>
> On Mon, Jan 17, 2022 at 6:05 AM Corey Minyard  wrote:
>
>> On Sun, Jan 09, 2022 at 06:17:34PM -0800, Patrick Venture wrote:
>> > On Fri, Jan 7, 2022 at 7:04 PM Patrick Venture 
>> wrote:
>> >
>> > > From: Hao Wu 
>> > >
>> > > SB Temperature Sensor Interface (SB-TSI) is an SMBus compatible
>> > > interface that reports AMD SoC's Ttcl (normalized temperature),
>> > > and resembles a typical 8-pin remote temperature sensor's I2C
>> interface
>> > > to BMC.
>> > >
>> > > This patch implements a basic AMD SB-TSI sensor that is
>> > > compatible with the open-source data sheet from AMD and Linux
>> > > kernel driver.
>> > >
>> > > Reference:
>> > > Linux kernel driver:
>> > > https://lkml.org/lkml/2020/12/11/968
>> > > Register Map:
>> > > https://developer.amd.com/wp-content/resources/56255_3_03.PDF
>> > > (Chapter 6)
>> > >
>> > > Signed-off-by: Hao Wu 
>> > > Reviewed-by: Doug Evans 
>> > > ---
>> > >  hw/sensor/Kconfig|   4 +
>> > >  hw/sensor/meson.build|   1 +
>> > >  hw/sensor/tmp_sbtsi.c| 393
>> +++
>> > >  hw/sensor/trace-events   |   5 +
>> > >  hw/sensor/trace.h|   1 +
>> > >  meson.build  |   1 +
>> > >  tests/qtest/meson.build  |   1 +
>> > >  tests/qtest/tmp_sbtsi-test.c | 180 
>> > >  8 files changed, 586 insertions(+)
>> > >  create mode 100644 hw/sensor/tmp_sbtsi.c
>> > >  create mode 100644 hw/sensor/trace-events
>> > >  create mode 100644 hw/sensor/trace.h
>> > >  create mode 100644 tests/qtest/tmp_sbtsi-test.c
>> > >
>> > > diff --git a/hw/sensor/Kconfig b/hw/sensor/Kconfig
>> > > index 9c8a049b06..27f6f79c84 100644
>> > > --- a/hw/sensor/Kconfig
>> > > +++ b/hw/sensor/Kconfig
>> > > @@ -21,3 +21,7 @@ config ADM1272
>> > >  config MAX34451
>> > >  bool
>> > >  depends on I2C
>> > > +
>> > > +config AMDSBTSI
>> > > +bool
>> > > +depends on I2C
>> > > diff --git a/hw/sensor/meson.build b/hw/sensor/meson.build
>> > > index 059c4ca935..f7b0e645eb 100644
>> > > --- a/hw/sensor/meson.build
>> > > +++ b/hw/sensor/meson.build
>> > > @@ -4,3 +4,4 @@ softmmu_ss.add(when: 'CONFIG_DPS310', if_true:
>> > > files('dps310.c'))
>> > >  softmmu_ss.add(when: 'CONFIG_EMC141X', if_true: files('emc141x.c'))
>> > >  softmmu_ss.add(when: 'CONFIG_ADM1272', if_true: files('adm1272.c'))
>> > >  softmmu_ss.add(when: 'CONFIG_MAX34451', if_true: files('max34451.c'))
>> > > +softmmu_ss.add(when: 'CONFIG_AMDSBTSI', if_true:
>> files('tmp_sbtsi.c'))
>> > > diff --git a/hw/sensor/tmp_sbtsi.c b/hw/sensor/tmp_sbtsi.c
>> > > new file mode 100644
>> > > index 00..b68c7ebf61
>> > > --- /dev/null
>> > > +++ b/hw/sensor/tmp_sbtsi.c
>> > > @@ -0,0 +1,393 @@
>> > > +/*
>> > > + * AMD SBI Temperature Sensor Interface (SB-TSI)
>> > > + *
>> > > + * Copyright 2021 Google LLC
>> > > + *
>> > > + * This program is free software; you can redistribute it and/or
>> modify it
>> > > + * under the terms of the GNU General Public License as published by
>> the
>> > > + * Free Software Foundation; either version 2 of the License, or
>> > > + * (at your option) any later version.
>> > > + *
>> > > + * This program is distributed in the hope that it will be useful,
>> but
>> > > WITHOUT
>> > > + * ANY WARRANTY; without even the implied warranty of
>> MERCHANTABILITY or
>> > > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public
>> License
>> > > + * for more details.
>> > > + */
>> > > +
>> > > +#include "qemu/osdep.h"
>> > > +#include "hw/i2c/smbus_slave.h"
>> > > +#include "hw/irq.h"
>> > > +#include "migration/vmstate.h"
>> > > +#include "qapi/error.h"
>> > > +#include "qapi/visitor.h"
>> > > +#include "qemu/log.h"
>> > > +#include "qemu/module.h"
>> > > +#include "trace.h"
>> > > +
>> > > +#define TYPE_SBTSI "sbtsi"
>> > > +#define SBTSI(obj) OBJECT_CHECK(SBTSIState, (obj), TYPE_SBTSI)
>> > > +
>> > > +/**
>> > > + * SBTSIState:
>> > > + * temperatures are in units of 0.125 degrees
>> > > + * @temperature: Temperature
>> > > + * @limit_low: Lowest temperature
>> > > + * @limit_high: Highest temperature
>> > > + * @status: The status register
>> > > + * @config: The config register
>> > > + * @alert_config: The config for alarm_l output.
>> > > + * @addr: The address to read/write for the next cmd.
>> > > + * @alarm: The alarm_l output pin (GPIO)
>> > > + */
>> > > +typedef struct SBTSIState {
>> > > +SMBusDevice parent;
>> > > +
>> > > +uint32_t temperature;
>> > > +uint32_t limit_low;
>> > > +uint32_t limit_high;
>> > > +uint8_t status;
>> > > +uint8_t config;
>> > > +uint8_t alert_config;
>> > > +uint8_t addr;
>> > > +qemu_irq alarm;
>> > > +}

Re: [PATCH v3 0/2] tpm: CRB: Use ram_device for "tpm-crb-cmd" region

2022-01-26 Thread Alex Williamson

On Thu, 20 Jan 2022 01:12:40 +0100
Philippe Mathieu-Daudé  wrote:

> This is a respin of Eric's work, but not making tpm_crb.c target
> specific.
> 
> Based-on: <2022012836.229419-1-f4...@amsat.org>
> "exec/cpu: Make host pages variables / macros 'target agnostic'"
> https://lore.kernel.org/qemu-devel/2022012836.229419-1-f4...@amsat.org/
> 
> --
> 
> Eric's v2 cover:
> 
> This series aims at removing a spurious error message we get when
> launching a guest with a TPM-CRB device and VFIO-PCI devices.
> 
> The CRB command buffer currently is a RAM MemoryRegion and given
> its base address alignment, it causes an error report on
> vfio_listener_region_add(). This series proposes to use a ram-device
> region instead which helps in better assessing the dma map error
> failure on VFIO side.
> 
> Eric Auger (2):
>   tpm: CRB: Use ram_device for "tpm-crb-cmd" region
>   hw/vfio/common: Silence ram device offset alignment error traces
> 
>  hw/tpm/tpm_crb.c | 22 --
>  hw/vfio/common.c | 15 ++-
>  hw/vfio/trace-events |  1 +
>  3 files changed, 35 insertions(+), 3 deletions(-)
> 

Unfortunately, FTB:

../hw/tpm/tpm_crb.c: In function ‘tpm_crb_realize’:
../hw/tpm/tpm_crb.c:297:33: warning: implicit declaration of function 
‘HOST_PAGE_ALIGN’ [-Wimplicit-function-declaration]
  297 | HOST_PAGE_ALIGN(CRB_CTRL_CMD_SIZE));
  | ^~~
../hw/tpm/tpm_crb.c:297:33: warning: nested extern declaration of 
‘HOST_PAGE_ALIGN’ [-Wnested-externs]

This is a regression since Eric's v2.  Thanks,

Alex

Re: [PATCH 0/2] Adds designware i2c module and adds it to virt arm

2022-01-26 Thread Philippe Mathieu-Daudé via


+Enrico Weigelt

On 26/1/22 19:03, Peter Maydell wrote:

On Wed, 26 Jan 2022 at 17:12, Chris Rauer  wrote:



I need to see a pretty strong justification for why we should be
adding new kinds of devices to the virt machine,


The designware i2c controller is a very common controller on many
  ARM SoCs.  It has device tree bindings and ACPI bindings which
makes it ideal for this platform.


No, I mean, why do we need an i2c controller on the virt
board at all ?


Forgot to mention, but my prefered approach for providing
an i2c controller on the virt board would be to have a
PCI i2c controller: that way users who do need it can plug it
in with a -device command line option, and users who don't
need it never have to worry about it.



(We seem to have an ICH9-SMB PCI device already; I have no idea if it's 
suitable.)

I didn't find that device suitable because it is part of the Intel
Southbridge, which may have some Intel platform quirks, and
we don't need all the things in that IO hub.


That's a pity. Is there a different PCI I2C controller we could model ?


What about using virtio-gpio & bitbang I2C?

- virtio-gpio
https://lore.kernel.org/qemu-devel/20201127182917.2387-5-i...@metux.net/

- bitbang I2C already in: hw/i2c/bitbang_i2c.c

Regards,

Phil.

Re: [Virtio-fs] [PATCH v4 1/2] virtiofsd: Track mounts

2022-01-26 Thread Vivek Goyal

On Wed, Jan 26, 2022 at 05:47:09PM -0500, Vivek Goyal wrote:
> On Tue, Jan 25, 2022 at 03:12:11PM +0100, Greg Kurz wrote:
> > The upcoming implementation of ->sync_fs() needs to know about all
> > submounts in order to call syncfs() on them when virtiofsd is started
> > without '-o announce_submounts'.
> > 
> > Track every inode that comes up with a new mount id in a GHashTable.
> > If the mount id isn't available, e.g. no statx() on the host, fallback
> > on the device id for the key. This is done during lookup because we
> > only care for the submounts that the client knows about. The inode
> > is removed from the hash table when ultimately unreferenced. This
> > can happen on a per-mount basis when the client posts a FUSE_FORGET
> > request or for all submounts at once with FUSE_DESTROY.
> > 
> > Signed-off-by: Greg Kurz 
> > ---
> >  tools/virtiofsd/passthrough_ll.c | 43 +---
> >  1 file changed, 40 insertions(+), 3 deletions(-)
> > 
> > diff --git a/tools/virtiofsd/passthrough_ll.c 
> > b/tools/virtiofsd/passthrough_ll.c
> > index 64b5b4fbb186..7bf31fc129c8 100644
> > --- a/tools/virtiofsd/passthrough_ll.c
> > +++ b/tools/virtiofsd/passthrough_ll.c
> > @@ -117,6 +117,7 @@ struct lo_inode {
> >  GHashTable *posix_locks; /* protected by lo_inode->plock_mutex */
> >  
> >  mode_t filetype;
> > +bool is_mnt;
> >  };
> >  
> >  struct lo_cred {
> > @@ -164,6 +165,7 @@ struct lo_data {
> >  bool use_statx;
> >  struct lo_inode root;
> >  GHashTable *inodes; /* protected by lo->mutex */
> > +GHashTable *mnt_inodes; /* protected by lo->mutex */
> >  struct lo_map ino_map; /* protected by lo->mutex */
> >  struct lo_map dirp_map; /* protected by lo->mutex */
> >  struct lo_map fd_map; /* protected by lo->mutex */
> > @@ -1000,6 +1002,31 @@ static int do_statx(struct lo_data *lo, int dirfd, 
> > const char *pathname,
> >  return 0;
> >  }
> >  
> 
> Hi Greg,
> 
> Thanks for the patches. Had a quick look. Overall these patches look
> pretty good to me. I will spend more time testing and having a 
> closer look. Some quick thoughts below.
> 
> > +static uint64_t mnt_inode_key(struct lo_inode *inode)
> > +{
> > +/* Prefer mnt_id, fallback on dev */
> > +return inode->key.mnt_id ? inode->key.mnt_id : inode->key.dev;
> > +}
> 
> I am not sure if we should use inode->key.dev. This might create problem
> if same file system is bind mounted at two paths in shared dir. So
> say /dev/sdb is mounted at foo1/ and then bind mounted at foo2/ in
> shared dir. A user looks up foo1/ and does some writes. Then we
> lookup foo2/ and release that inode. Release of foo2 will let go
> inode from the hash. And that means if later another write happens
> in foo1/ followed by syncfs(), we will not issue syncfs() on filesystem
> backed by /dev/sdb.
> 
> So what are the options.
> 
> A. Make mnt_id mandatory and do not implement it if mnt_id is not
>available.
> 
> B. Don't do anything and live with this. It is a corner case and
>still better than not implement submount syncfs at all.
> 
> C. Instead of adding lo_inode to hash, create another kind of object
>and reference count that. It could be a mount fd which we open
>when we add object for the first time. So when foo1/ inode is
>instantiated, create mountfd object, add it to hash table using
>device id as the key. When foo2 comes along, we find the object
>in the hash and just bump up the ref. Now this mountfd object
>will go away when both foo1 and foo2 inodes have been evicted
>and will take care of the issue I am referring to.

And we could take a ref on mountfd object only when we find an
inode whose parent's device id/mnt_id is different from us. That
way for every inode in the system we don't go through this exercise.
Just only those dir inodes which are a mount point.

Vivek

> 
> I guess B is little extra complexity but probably not too bad.
> WDYT. It sounds litter better than option A and B.
> 
> 
> > +
> > +static void add_mnt_inode(struct lo_data *lo, struct lo_inode *inode)
> > +{
> > +uint64_t mnt_key = mnt_inode_key(inode);
> > +
> > +if (!g_hash_table_contains(lo->mnt_inodes, _key)) {
> > +inode->is_mnt = true;
> > +g_hash_table_insert(lo->mnt_inodes, _key, inode);
> > +}
> > +}
> > +
> > +static void remove_mnt_inode(struct lo_data *lo, struct lo_inode *inode)
> > +{
> > +uint64_t mnt_key = mnt_inode_key(inode);
> > +
> > +if (inode->is_mnt) {
> > +g_hash_table_remove(lo->mnt_inodes, _key);
> > +}
> > +}
> 
> Should we issue syncfs() on this inode when we are removing it? It
> is possible guest did some writes, let go inode and later issued
> a syncfs(). By that time inode is gone and we will not issue any
> syncfs() on this filesystem. Hence leaving data in host page cache.
> 
> Thanks
> Vivek
> 
> > +
> >  /*
> >   * Increments nlookup on the inode on success. unref_inode_lolocked() must 
> > be
> >   *

Re: [PATCH v4 1/2] virtiofsd: Track mounts

2022-01-26 Thread Vivek Goyal

On Tue, Jan 25, 2022 at 03:12:11PM +0100, Greg Kurz wrote:
> The upcoming implementation of ->sync_fs() needs to know about all
> submounts in order to call syncfs() on them when virtiofsd is started
> without '-o announce_submounts'.
> 
> Track every inode that comes up with a new mount id in a GHashTable.
> If the mount id isn't available, e.g. no statx() on the host, fallback
> on the device id for the key. This is done during lookup because we
> only care for the submounts that the client knows about. The inode
> is removed from the hash table when ultimately unreferenced. This
> can happen on a per-mount basis when the client posts a FUSE_FORGET
> request or for all submounts at once with FUSE_DESTROY.
> 
> Signed-off-by: Greg Kurz 
> ---
>  tools/virtiofsd/passthrough_ll.c | 43 +---
>  1 file changed, 40 insertions(+), 3 deletions(-)
> 
> diff --git a/tools/virtiofsd/passthrough_ll.c 
> b/tools/virtiofsd/passthrough_ll.c
> index 64b5b4fbb186..7bf31fc129c8 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -117,6 +117,7 @@ struct lo_inode {
>  GHashTable *posix_locks; /* protected by lo_inode->plock_mutex */
>  
>  mode_t filetype;
> +bool is_mnt;
>  };
>  
>  struct lo_cred {
> @@ -164,6 +165,7 @@ struct lo_data {
>  bool use_statx;
>  struct lo_inode root;
>  GHashTable *inodes; /* protected by lo->mutex */
> +GHashTable *mnt_inodes; /* protected by lo->mutex */
>  struct lo_map ino_map; /* protected by lo->mutex */
>  struct lo_map dirp_map; /* protected by lo->mutex */
>  struct lo_map fd_map; /* protected by lo->mutex */
> @@ -1000,6 +1002,31 @@ static int do_statx(struct lo_data *lo, int dirfd, 
> const char *pathname,
>  return 0;
>  }
>  

Hi Greg,

Thanks for the patches. Had a quick look. Overall these patches look
pretty good to me. I will spend more time testing and having a 
closer look. Some quick thoughts below.

> +static uint64_t mnt_inode_key(struct lo_inode *inode)
> +{
> +/* Prefer mnt_id, fallback on dev */
> +return inode->key.mnt_id ? inode->key.mnt_id : inode->key.dev;
> +}

I am not sure if we should use inode->key.dev. This might create problem
if same file system is bind mounted at two paths in shared dir. So
say /dev/sdb is mounted at foo1/ and then bind mounted at foo2/ in
shared dir. A user looks up foo1/ and does some writes. Then we
lookup foo2/ and release that inode. Release of foo2 will let go
inode from the hash. And that means if later another write happens
in foo1/ followed by syncfs(), we will not issue syncfs() on filesystem
backed by /dev/sdb.

So what are the options.

A. Make mnt_id mandatory and do not implement it if mnt_id is not
   available.

B. Don't do anything and live with this. It is a corner case and
   still better than not implement submount syncfs at all.

C. Instead of adding lo_inode to hash, create another kind of object
   and reference count that. It could be a mount fd which we open
   when we add object for the first time. So when foo1/ inode is
   instantiated, create mountfd object, add it to hash table using
   device id as the key. When foo2 comes along, we find the object
   in the hash and just bump up the ref. Now this mountfd object
   will go away when both foo1 and foo2 inodes have been evicted
   and will take care of the issue I am referring to.

I guess B is little extra complexity but probably not too bad.
WDYT. It sounds litter better than option A and B.

> +
> +static void add_mnt_inode(struct lo_data *lo, struct lo_inode *inode)
> +{
> +uint64_t mnt_key = mnt_inode_key(inode);
> +
> +if (!g_hash_table_contains(lo->mnt_inodes, _key)) {
> +inode->is_mnt = true;
> +g_hash_table_insert(lo->mnt_inodes, _key, inode);
> +}
> +}
> +
> +static void remove_mnt_inode(struct lo_data *lo, struct lo_inode *inode)
> +{
> +uint64_t mnt_key = mnt_inode_key(inode);
> +
> +if (inode->is_mnt) {
> +g_hash_table_remove(lo->mnt_inodes, _key);
> +}
> +}

Should we issue syncfs() on this inode when we are removing it? It
is possible guest did some writes, let go inode and later issued
a syncfs(). By that time inode is gone and we will not issue any
syncfs() on this filesystem. Hence leaving data in host page cache.

Thanks
Vivek

> +
>  /*
>   * Increments nlookup on the inode on success. unref_inode_lolocked() must be
>   * called eventually to decrement nlookup again. If inodep is non-NULL, the
> @@ -1086,10 +1113,15 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t 
> parent, const char *name,
>  pthread_mutex_lock(>mutex);
>  inode->fuse_ino = lo_add_inode_mapping(req, inode);
>  g_hash_table_insert(lo->inodes, >key, inode);
> +add_mnt_inode(lo, inode);
>  pthread_mutex_unlock(>mutex);
>  }
>  e->ino = inode->fuse_ino;
>  
> +fuse_log(FUSE_LOG_DEBUG, "  %lli/%s -> %lli%s\n",
> + (unsigned long long)

Re: [PATCH v2 07/14] target/ppc: 405: External exception cleanup

2022-01-26 Thread Fabiano Rosas

Richard Henderson  writes:

> On 1/19/22 5:44 AM, Fabiano Rosas wrote:
>> 405 has no MSR_HV and EPR is BookE only so we can remove it all.
>> 
>> Signed-off-by: Fabiano Rosas 
>> Reviewed-by: David Gibson 
>> ---
>>   target/ppc/excp_helper.c | 37 -
>>   1 file changed, 37 deletions(-)
>> 
>> diff --git a/target/ppc/excp_helper.c b/target/ppc/excp_helper.c
>> index e98d783ecd..8fae8aa0be 100644
>> --- a/target/ppc/excp_helper.c
>> +++ b/target/ppc/excp_helper.c
>> @@ -472,44 +472,7 @@ static void powerpc_excp_40x(PowerPCCPU *cpu, int excp)
>>   msr |= env->error_code;
>>   break;
>>   case POWERPC_EXCP_EXTERNAL:  /* External input 
>>   */
>> -{
>> -bool lpes0;
>> -
>> -cs = CPU(cpu);
>> -
>> -/*
>> - * Exception targeting modifiers
>> - *
>> - * LPES0 is supported on POWER7/8/9
>> - * LPES1 is not supported (old iSeries mode)
>> - *
>> - * On anything else, we behave as if LPES0 is 1
>> - * (externals don't alter MSR:HV)
>> - */
>> -#if defined(TARGET_PPC64)
>> -if (excp_model == POWERPC_EXCP_POWER7 ||
>> -excp_model == POWERPC_EXCP_POWER8 ||
>> -excp_model == POWERPC_EXCP_POWER9 ||
>> -excp_model == POWERPC_EXCP_POWER10) {
>> -lpes0 = !!(env->spr[SPR_LPCR] & LPCR_LPES0);
>> -} else
>> -#endif /* defined(TARGET_PPC64) */
>> -{
>> -lpes0 = true;
>> -}
>> -
>> -if (!lpes0) {
>> -new_msr |= (target_ulong)MSR_HVB;
>> -new_msr |= env->msr & ((target_ulong)1 << MSR_RI);
>> -srr0 = SPR_HSRR0;
>> -srr1 = SPR_HSRR1;
>> -}
>> -if (env->mpic_proxy) {
>> -/* IACK the IRQ on delivery */
>> -env->spr[SPR_BOOKE_EPR] = ldl_phys(cs->as, env->mpic_iack);
>> -}
>>   break;
>> -}
>
> Bare break?  Should this be reachable at all?
> Should it in fact be g_assert_not_reached()?

It should be reachable. It is a valid exception for this CPU. We just
don't have anything else to do to dispatch it aside from what is done in
the generic code outside the switch statement.

Re: [PATCH v2 08/14] target/ppc: 405: System call exception cleanup

2022-01-26 Thread Richard Henderson


On 1/19/22 5:44 AM, Fabiano Rosas wrote:

There's no sc 1.

Signed-off-by: Fabiano Rosas
---
  target/ppc/excp_helper.c | 21 ++---
  1 file changed, 2 insertions(+), 19 deletions(-)


Reviewed-by: Richard Henderson 

r~

Re: [PATCH 0/2] Adds designware i2c module and adds it to virt arm

2022-01-26 Thread Chris Rauer

>>> I need to see a pretty strong justification for why we should be

>>> adding new kinds of devices to the virt machine,

>>

>> The designware i2c controller is a very common controller on many

>>  ARM SoCs.  It has device tree bindings and ACPI bindings which

>> makes it ideal for this platform.

>

>No, I mean, why do we need an i2c controller on the virt

>board at all ?

The use case we are interested in is to enable IPMI SSIF on the arm virt
machine which is why I was interested in something with ACPI support.  That
particular IPMI SSIF enablement patch is not in this series but I think it
could be pulled into the series if necessary.

> > >Forgot to mention, but my prefered approach for providing

> > >an i2c controller on the virt board would be to have a

> > >PCI i2c controller: that way users who do need it can plug it

> > >in with a -device command line option, and users who don't

> > >need it never have to worry about it.

>

> > > (We seem to have an ICH9-SMB PCI device already; I have no idea if
it's suitable.)

> > I didn't find that device suitable because it is part of the Intel

> > Southbridge, which may have some Intel platform quirks, and

> > we don't need all the things in that IO hub.

>

> That's a pity. Is there a different PCI I2C controller we could model ?

I’m not aware of any standalone PCI I2C controllers.  I’ve seen I2C
controllers on PCI devices with other things but I don’t think those could
be used for IPMI SSIF or other general I2C use cases.   Do you know of a
particular device I should take a look at?


-Chris


On Wed, Jan 26, 2022 at 10:03 AM Peter Maydell 
wrote:

> On Wed, 26 Jan 2022 at 17:12, Chris Rauer  wrote:
> >
> >> I need to see a pretty strong justification for why we should be
> >> adding new kinds of devices to the virt machine,
> >
> > The designware i2c controller is a very common controller on many
> >  ARM SoCs.  It has device tree bindings and ACPI bindings which
> > makes it ideal for this platform.
>
> No, I mean, why do we need an i2c controller on the virt
> board at all ?
>
> > >Forgot to mention, but my prefered approach for providing
> > >an i2c controller on the virt board would be to have a
> > >PCI i2c controller: that way users who do need it can plug it
> > >in with a -device command line option, and users who don't
> > >need it never have to worry about it.
>
> > > (We seem to have an ICH9-SMB PCI device already; I have no idea if
> it's suitable.)
> > I didn't find that device suitable because it is part of the Intel
> > Southbridge, which may have some Intel platform quirks, and
> > we don't need all the things in that IO hub.
>
> That's a pity. Is there a different PCI I2C controller we could model ?
>
> thanks
> -- PMM
>

Re: [PATCH v2 05/14] target/ppc: 405: Critical exceptions cleanup

2022-01-26 Thread Richard Henderson


On 1/19/22 5:44 AM, Fabiano Rosas wrote:

In powerpc_excp_40x the Critical exception is now for 405 only, so we
can remove the BookE and G2 blocks.

Signed-off-by: Fabiano Rosas
Reviewed-by: David Gibson
---
  target/ppc/excp_helper.c | 17 ++---
  1 file changed, 2 insertions(+), 15 deletions(-)


Reviewed-by: Richard Henderson 

r~

Re: [PATCH v2 07/14] target/ppc: 405: External exception cleanup

2022-01-26 Thread Richard Henderson


On 1/19/22 5:44 AM, Fabiano Rosas wrote:

405 has no MSR_HV and EPR is BookE only so we can remove it all.

Signed-off-by: Fabiano Rosas 
Reviewed-by: David Gibson 
---
  target/ppc/excp_helper.c | 37 -
  1 file changed, 37 deletions(-)

diff --git a/target/ppc/excp_helper.c b/target/ppc/excp_helper.c
index e98d783ecd..8fae8aa0be 100644
--- a/target/ppc/excp_helper.c
+++ b/target/ppc/excp_helper.c
@@ -472,44 +472,7 @@ static void powerpc_excp_40x(PowerPCCPU *cpu, int excp)
  msr |= env->error_code;
  break;
  case POWERPC_EXCP_EXTERNAL:  /* External input   
*/
-{
-bool lpes0;
-
-cs = CPU(cpu);
-
-/*
- * Exception targeting modifiers
- *
- * LPES0 is supported on POWER7/8/9
- * LPES1 is not supported (old iSeries mode)
- *
- * On anything else, we behave as if LPES0 is 1
- * (externals don't alter MSR:HV)
- */
-#if defined(TARGET_PPC64)
-if (excp_model == POWERPC_EXCP_POWER7 ||
-excp_model == POWERPC_EXCP_POWER8 ||
-excp_model == POWERPC_EXCP_POWER9 ||
-excp_model == POWERPC_EXCP_POWER10) {
-lpes0 = !!(env->spr[SPR_LPCR] & LPCR_LPES0);
-} else
-#endif /* defined(TARGET_PPC64) */
-{
-lpes0 = true;
-}
-
-if (!lpes0) {
-new_msr |= (target_ulong)MSR_HVB;
-new_msr |= env->msr & ((target_ulong)1 << MSR_RI);
-srr0 = SPR_HSRR0;
-srr1 = SPR_HSRR1;
-}
-if (env->mpic_proxy) {
-/* IACK the IRQ on delivery */
-env->spr[SPR_BOOKE_EPR] = ldl_phys(cs->as, env->mpic_iack);
-}
  break;
-}


Bare break?  Should this be reachable at all?
Should it in fact be g_assert_not_reached()?


r~

Re: [PATCH v2 03/14] target/ppc: Introduce powerpc_excp_40x

2022-01-26 Thread Richard Henderson


On 1/19/22 5:44 AM, Fabiano Rosas wrote:

Introduce a new powerpc_excp function specific for 40x CPUs. This
commit copies powerpc_excp_legacy verbatim so the next one has a clean
diff.

Signed-off-by: Fabiano Rosas
---
  target/ppc/excp_helper.c | 474 +++
  1 file changed, 474 insertions(+)


Reviewed-by: Richard Henderson 

r~

Re: [PATCH 0/8] target/ppc: powerpc_excp improvements [74xx] (5/n)

2022-01-26 Thread Fabiano Rosas

BALATON Zoltan  writes:

> On Wed, 26 Jan 2022, Fabiano Rosas wrote:
>> This handles the exception code for the 74xx family, i.e. 7400, 7410,
>> 7440, 7445, 7450, 7455, 7457.
>>
>> This is another family that is quite well known, so it should be
>> straight-forward as well.
>
> I guess this is what may break VOF on pegasos2. Was Philippe's test case 
> for this machine ever merged? (Although that may use the firmware ROM that 
> was preferred as it tests more of the machine and may predate VOF so not 
> sure it also tests with VOF.) The way to test it is this:
>
> Get morphos demo ISO from https://www.morphos-team.net/morphos-3.15.iso
> Extract boot.img from the root directory of the CD
> Run QEMU as shown at http://zero.eik.bme.hu/~balaton/qemu/amiga/#morphos
>
> (For debugging maybe enabling vof traces would give more info but it was 
> a while so I don't remember the details any more.)
>
>> Based on legoater/ppc-7.0
>
> I could test when it's merged or when it applies on master but I don't 
> usually test on branches. Did you verify it still works with pegasos2 or 
> could you please make sure it won't break that use case?

I can test with pegasos2. I'll make sure we don't merge this before
testing there.

That said, I forgot about the sc 1 thing and it is indeed broken. I'll
send a v2 fixing it.

RE: [RFC v4 12/21] vfio-user: region read/write

2022-01-26 Thread Thanos Makatos



> -Original Message-
> From: Qemu-devel  bounces+thanos.makatos=nutanix@nongnu.org> On Behalf Of John
> Johnson
> Sent: 12 January 2022 00:44
> To: qemu-devel@nongnu.org
> Subject: [RFC v4 12/21] vfio-user: region read/write
> 
> Add support for posted writes on remote devices
> 
> Signed-off-by: Elena Ufimtseva 
> Signed-off-by: John G Johnson 
> Signed-off-by: Jagannathan Raman 
> ---
>  hw/vfio/pci.h |   1 +
>  hw/vfio/user-protocol.h   |  12 +
>  hw/vfio/user.h|   1 +
>  include/hw/vfio/vfio-common.h |   7 +--
>  hw/vfio/common.c  |  10 +++-
>  hw/vfio/pci.c |   9 +++-
>  hw/vfio/user.c| 109
> ++
>  7 files changed, 143 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index ec9f345..643ff75 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -194,6 +194,7 @@ struct VFIOUserPCIDevice {
>  VFIOPCIDevice device;
>  char *sock_name;
>  bool send_queued;   /* all sends are queued */
> +bool no_post;   /* all regions write are sync */
>  };
> 
>  /* Use uin32_t for vendor & device so PCI_ANY_ID expands and cannot match
> hw */
> diff --git a/hw/vfio/user-protocol.h b/hw/vfio/user-protocol.h
> index caa523a..b1ea55f 100644
> --- a/hw/vfio/user-protocol.h
> +++ b/hw/vfio/user-protocol.h
> @@ -120,4 +120,16 @@ typedef struct {
>  uint64_t offset;
>  } VFIOUserRegionInfo;
> 
> +/*
> + * VFIO_USER_REGION_READ
> + * VFIO_USER_REGION_WRITE
> + */
> +typedef struct {
> +VFIOUserHdr hdr;
> +uint64_t offset;
> +uint32_t region;
> +uint32_t count;
> +char data[];
> +} VFIOUserRegionRW;
> +
>  #endif /* VFIO_USER_PROTOCOL_H */
> diff --git a/hw/vfio/user.h b/hw/vfio/user.h
> index 19edd84..f2098f2 100644
> --- a/hw/vfio/user.h
> +++ b/hw/vfio/user.h
> @@ -75,6 +75,7 @@ typedef struct VFIOProxy {
>  /* VFIOProxy flags */
>  #define VFIO_PROXY_CLIENT0x1
>  #define VFIO_PROXY_FORCE_QUEUED  0x4
> +#define VFIO_PROXY_NO_POST   0x8
> 
>  VFIOProxy *vfio_user_connect_dev(SocketAddress *addr, Error **errp);
>  void vfio_user_disconnect(VFIOProxy *proxy);
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 2552557..4118b8a 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -57,6 +57,7 @@ typedef struct VFIORegion {
>  VFIOMmap *mmaps;
>  uint8_t nr; /* cache the region number for debug */
>  int fd; /* fd to mmap() region */
> +bool post_wr; /* writes can be posted */
>  } VFIORegion;
> 
>  typedef struct VFIOMigration {
> @@ -180,7 +181,7 @@ struct VFIODevIO {
>  int (*region_read)(VFIODevice *vdev, uint8_t nr, off_t off, uint32_t 
> size,
> void *data);
>  int (*region_write)(VFIODevice *vdev, uint8_t nr, off_t off, uint32_t 
> size,
> -void *data);
> +void *data, bool post);
>  };
> 
>  #define VDEV_GET_INFO(vdev, info) \
> @@ -193,8 +194,8 @@ struct VFIODevIO {
>  ((vdev)->io_ops->set_irqs((vdev), (irqs)))
>  #define VDEV_REGION_READ(vdev, nr, off, size, data) \
>  ((vdev)->io_ops->region_read((vdev), (nr), (off), (size), (data)))
> -#define VDEV_REGION_WRITE(vdev, nr, off, size, data) \
> -((vdev)->io_ops->region_write((vdev), (nr), (off), (size), (data)))
> +#define VDEV_REGION_WRITE(vdev, nr, off, size, data, post) \
> +((vdev)->io_ops->region_write((vdev), (nr), (off), (size), (data), 
> (post)))
> 
>  struct VFIOContIO {
>  int (*dma_map)(VFIOContainer *container,
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index a50bf4b..83cc5ec 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -213,6 +213,7 @@ void vfio_region_write(void *opaque, hwaddr addr,
>  uint32_t dword;
>  uint64_t qword;
>  } buf;
> +bool post = region->post_wr;
>  int ret;
> 
>  switch (size) {
> @@ -233,7 +234,11 @@ void vfio_region_write(void *opaque, hwaddr addr,
>  break;
>  }
> 
> -ret = VDEV_REGION_WRITE(vbasedev, region->nr, addr, size, );
> +/* read-after-write hazard if guest can directly access region */
> +if (region->nr_mmaps) {
> +post = false;
> +}
> +ret = VDEV_REGION_WRITE(vbasedev, region->nr, addr, size, , post);
>  if (ret != size) {
>  const char *err = ret < 0 ? strerror(-ret) : "short write";
> 
> @@ -1555,6 +1560,7 @@ int vfio_region_setup(Object *obj, VFIODevice
> *vbasedev, VFIORegion *region,
>  region->size = info->size;
>  region->fd_offset = info->offset;
>  region->nr = index;
> +region->post_wr = false;
>  if (vbasedev->regfds != NULL) {
>  region->fd = vbasedev->regfds[index];
>  } else {
> @@ -2689,7 +2695,7 @@ static int vfio_io_region_read(VFIODevice *vbasedev,
> uint8_t index, off_t off,
>  }
> 
>  static int vfio_io_region_write(VFIODevice *vbasedev, uint8_t index, off_t 
> off,

Re: [RFC PATCH] linux-user: expand reserved brk space for 64bit guests

2022-01-26 Thread Richard Henderson


On 1/14/22 3:55 AM, Alex Bennée wrote:

A recent change to fix commpage allocation issues on 32bit hosts
revealed another intermittent issue on s390x. The root cause was the
headroom we give for the brk space wasn't enough causing the guest to
attempt to map something on top of QEMUs own pages. We do not
currently do anything to protect from this (see #555).

By inspection the brk mmap moves around and top of the address range
has been measured as far as 19Mb away from the top of the binary. As
we chose a smallish number to keep 32bit on 32 bit feasible we only
increase the gap for 64 bit guests. This does mean that 64-on-32
static binaries are more likely to fail to find a hole in the address
space but that is hopefully a fairly rare situation.

Signed-off-by: Alex Bennée 
---
  linux-user/elfload.c | 12 +---
  1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/linux-user/elfload.c b/linux-user/elfload.c
index 64b87d37e8..9628a38361 100644
--- a/linux-user/elfload.c
+++ b/linux-user/elfload.c
@@ -2800,11 +2800,17 @@ static void load_elf_image(const char *image_name, int 
image_fd,
   * and the stack, lest they be placed immediately after
   * the data segment and block allocation from the brk.
   *
- * 16MB is chosen as "large enough" without being so large
- * as to allow the result to not fit with a 32-bit guest on
- * a 32-bit host.
+ * 16MB is chosen as "large enough" without being so large as
+ * to allow the result to not fit with a 32-bit guest on a
+ * 32-bit host. However some 64 bit guests (e.g. s390x)
+ * attempt to place their heap further ahead and currently
+ * nothing stops them smashing into QEMUs address space.
   */
+#if TARGET_LONG_BITS == 64
+info->reserve_brk = 32 * MiB;
+#else
  info->reserve_brk = 16 * MiB;
+#endif


I'd prefer to use 32M unconditionally, but...
Reviewed-by: Richard Henderson 

r~

Re: [PATCH 2/2] qemu-timer: Skip empty timer lists before locking in qemu_clock_deadline_ns_all

2022-01-26 Thread Richard Henderson


On 1/14/22 11:43 AM, Idan Horowitz wrote:

This decreases qemu_clock_deadline_ns_all's share from 23.2% to 13% in a
profile of icount-enabled aarch64-softmmu.

Signed-off-by: Idan Horowitz
---
  util/qemu-timer.c | 3 +++
  1 file changed, 3 insertions(+)


Reviewed-by: Richard Henderson 

r~

Re: [PATCH 1/2] softmmu/cpus: Check if the cpu work list is empty atomically

2022-01-26 Thread Richard Henderson


On 1/14/22 11:43 AM, Idan Horowitz wrote:

Instead of taking the lock of the cpu work list in order to check if it's
empty, we can just read the head pointer atomically. This decreases
cpu_work_list_empty's share from 5% to 1.3% in a profile of icount-enabled
aarch64-softmmu.

Signed-off-by: Idan Horowitz
---
  softmmu/cpus.c | 7 +--
  1 file changed, 1 insertion(+), 6 deletions(-)


Reviewed-by: Richard Henderson 

Queued to tcg-next.


r~

Re: [PATCH v1 21/22] target/i386: use CPU_LOG_INT for IRQ servicing

2022-01-26 Thread Richard Henderson


On 1/25/22 7:16 AM, Alex Bennée wrote:

I think these have been wrong since f193c7979c (do not depend on
thunk.h - more log items). Fix them so as not to confuse other
debugging.

Signed-off-by: Alex Bennée
---
  target/i386/tcg/sysemu/seg_helper.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)


Reviewed-by: Richard Henderson 

r~

[PATCH v3 0/2] linux-user: check read permissions before how

2022-01-26 Thread Patrick Venture

In sigprocmask check the read permissions first before checking the `how`.

This is done for both: TARGET_NR_sigprocmask and TARGET_NR_rt_sigprocmask

v3:
 * Added missing signature from venture
v2:
 * Update code style during code change
 * Also update check order for TARGET_NR_sigprocmask

Patrick Venture (1):
  linux-user: sigprocmask check read perms first

Shu-Chun Weng (1):
  linux-user: rt_sigprocmask, check read perms first

 linux-user/syscall.c | 24 ++--
 1 file changed, 14 insertions(+), 10 deletions(-)

-- 
2.35.0.rc0.227.g00780c9af4-goog

[PATCH v3 1/2] linux-user: rt_sigprocmask, check read perms first

2022-01-26 Thread Patrick Venture

From: Shu-Chun Weng 

Linux kernel does it this way (checks read permission before validating `how`)
and the latest version of ABSL's `AddressIsReadable()` depends on this
behavior.

c.f.  
https://github.com/torvalds/linux/blob/9539ba4308ad5bdca6cb41c7b73cbb9f796dcdd7/kernel/signal.c#L3147
Reviewed-by: Patrick Venture 
Signed-off-by: Shu-Chun Weng 
Reviewed-by: Laurent Vivier 
Signed-off-by: Patrick Venture 
---
 linux-user/syscall.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index 5950222a77..34bd819e38 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -9508,6 +9508,13 @@ static abi_long do_syscall1(void *cpu_env, int num, 
abi_long arg1,
 }
 
 if (arg2) {
+p = lock_user(VERIFY_READ, arg2, sizeof(target_sigset_t), 1);
+if (!p) {
+return -TARGET_EFAULT;
+}
+target_to_host_sigset(, p);
+unlock_user(p, arg2, 0);
+set_ptr = 
 switch(how) {
 case TARGET_SIG_BLOCK:
 how = SIG_BLOCK;
@@ -9521,11 +9528,6 @@ static abi_long do_syscall1(void *cpu_env, int num, 
abi_long arg1,
 default:
 return -TARGET_EINVAL;
 }
-if (!(p = lock_user(VERIFY_READ, arg2, 
sizeof(target_sigset_t), 1)))
-return -TARGET_EFAULT;
-target_to_host_sigset(, p);
-unlock_user(p, arg2, 0);
-set_ptr = 
 } else {
 how = 0;
 set_ptr = NULL;
-- 
2.35.0.rc0.227.g00780c9af4-goog

[PATCH v3 2/2] linux-user: sigprocmask check read perms first

2022-01-26 Thread Patrick Venture

Linux kernel now checks the read permissions before validating `how`

Suggested-by: Laurent Vivier 
Signed-off-by: Patrick Venture 
Reviewed-by: Laurent Vivier 
---
 linux-user/syscall.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index 34bd819e38..210483d4e4 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -9465,6 +9465,13 @@ static abi_long do_syscall1(void *cpu_env, int num, 
abi_long arg1,
 int how;
 
 if (arg2) {
+p = lock_user(VERIFY_READ, arg2, sizeof(target_sigset_t), 1));
+if (!p) {
+return -TARGET_EFAULT;
+}
+target_to_host_old_sigset(, p);
+unlock_user(p, arg2, 0);
+set_ptr = 
 switch (arg1) {
 case TARGET_SIG_BLOCK:
 how = SIG_BLOCK;
@@ -9478,11 +9485,6 @@ static abi_long do_syscall1(void *cpu_env, int num, 
abi_long arg1,
 default:
 return -TARGET_EINVAL;
 }
-if (!(p = lock_user(VERIFY_READ, arg2, 
sizeof(target_sigset_t), 1)))
-return -TARGET_EFAULT;
-target_to_host_old_sigset(, p);
-unlock_user(p, arg2, 0);
-set_ptr = 
 } else {
 how = 0;
 set_ptr = NULL;
-- 
2.35.0.rc0.227.g00780c9af4-goog

Re: [PATCH] migration: Don't return for postcopy_send_discard_bm_ram()

2022-01-26 Thread Juan Quintela

Philippe Mathieu-Daudé  wrote:
> postcopy_send_discard_bm_ram() always return zero. Since it can't
> fail, simplify and do not return anything.
>
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
> Based-on: <20211224065000.97572-1-pet...@redhat.com>

And here, I learn to read Based-on:

Reviewed-by: Juan Quintela 

queued.

Re: [PATCH 5/7] migration: Drop return code for disgard ram process

2022-01-26 Thread Juan Quintela

Peter Xu  wrote:
> It will just never fail.  Drop those return values where they're constantly
> zeros.
>
> A tiny touch-up on the tracepoint so trace_ram_postcopy_send_discard_bitmap()
> is called after the logic itself (which sounds more reasonable).
>
> Signed-off-by: Peter Xu 

Reviewed-by: Juan Quintela 

queued

Re: [PATCH 4/7] migration: Do chunk page in postcopy_each_ram_send_discard()

2022-01-26 Thread Juan Quintela

Peter Xu  wrote:
> Right now we loop ramblocks for twice, the 1st time chunk the dirty bits with
> huge page information; the 2nd time we send the discard ranges.  That's not
> necessary - we can do them in a single loop.
>
> Signed-off-by: Peter Xu 

Reviewed-by: Juan Quintela 

queued

Re: [PATCH 3/7] migration: Drop postcopy_chunk_hostpages()

2022-01-26 Thread Juan Quintela

Peter Xu  wrote:
> This function calls three functions:
>
>   - postcopy_discard_send_init(ms, block->idstr);
>   - postcopy_chunk_hostpages_pass(ms, block);
>   - postcopy_discard_send_finish(ms);
>
> However only the 2nd function call is meaningful.  It's major role is to make
> sure dirty bits are applied in host-page-size granule, so there will be no
> partial dirty bits set for a whole host page if huge pages are used.
>
> The 1st/3rd call are for latter when we want to send the disgard ranges.
> They're mostly no-op here besides some tracepoints (which are misleading!).
>
> Drop them, then we can directly drop postcopy_chunk_hostpages() as a whole
> because we can call postcopy_chunk_hostpages_pass() directly.
>
> There're still some nice comments above postcopy_chunk_hostpages() that 
> explain
> what it does.  Copy it over to the caller's site.
>
> Signed-off-by: Peter Xu 

Reviewed-by: Juan Quintela 

queued

Re: [PATCH v5 03/18] pci: isolated address space for PCI bus

2022-01-26 Thread Michael S. Tsirkin

On Wed, Jan 26, 2022 at 08:07:36PM +, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefa...@redhat.com) wrote:
> > On Wed, Jan 26, 2022 at 05:27:32AM +, Jag Raman wrote:
> > > 
> > > 
> > > > On Jan 25, 2022, at 1:38 PM, Dr. David Alan Gilbert 
> > > >  wrote:
> > > > 
> > > > * Jag Raman (jag.ra...@oracle.com) wrote:
> > > >> 
> > > >> 
> > > >>> On Jan 19, 2022, at 7:12 PM, Michael S. Tsirkin  
> > > >>> wrote:
> > > >>> 
> > > >>> On Wed, Jan 19, 2022 at 04:41:52PM -0500, Jagannathan Raman wrote:
> > >  Allow PCI buses to be part of isolated CPU address spaces. This has a
> > >  niche usage.
> > >  
> > >  TYPE_REMOTE_MACHINE allows multiple VMs to house their PCI devices in
> > >  the same machine/server. This would cause address space collision as
> > >  well as be a security vulnerability. Having separate address spaces 
> > >  for
> > >  each PCI bus would solve this problem.
> > > >>> 
> > > >>> Fascinating, but I am not sure I understand. any examples?
> > > >> 
> > > >> Hi Michael!
> > > >> 
> > > >> multiprocess QEMU and vfio-user implement a client-server model to 
> > > >> allow
> > > >> out-of-process emulation of devices. The client QEMU, which makes 
> > > >> ioctls
> > > >> to the kernel and runs VCPUs, could attach devices running in a server
> > > >> QEMU. The server QEMU needs access to parts of the client’s RAM to
> > > >> perform DMA.
> > > > 
> > > > Do you ever have the opposite problem? i.e. when an emulated PCI device
> > > 
> > > That’s an interesting question.
> > > 
> > > > exposes a chunk of RAM-like space (frame buffer, or maybe a mapped file)
> > > > that the client can see.  What happens if two emulated devices need to
> > > > access each others emulated address space?
> > > 
> > > In this case, the kernel driver would map the destination’s chunk of 
> > > internal RAM into
> > > the DMA space of the source device. Then the source device could write to 
> > > that
> > > mapped address range, and the IOMMU should direct those writes to the
> > > destination device.
> > > 
> > > I would like to take a closer look at the IOMMU implementation on how to 
> > > achieve
> > > this, and get back to you. I think the IOMMU would handle this. Could you 
> > > please
> > > point me to the IOMMU implementation you have in mind?
> > 
> > I don't know if the current vfio-user client/server patches already
> > implement device-to-device DMA, but the functionality is supported by
> > the vfio-user protocol.
> > 
> > Basically: if the DMA regions lookup inside the vfio-user server fails,
> > fall back to VFIO_USER_DMA_READ/WRITE messages instead.
> > https://github.com/nutanix/libvfio-user/blob/master/docs/vfio-user.rst#vfio-user-dma-read
> > 
> > Here is the flow:
> > 1. The vfio-user server with device A sends a DMA read to QEMU.
> > 2. QEMU finds the MemoryRegion associated with the DMA address and sees
> >it's a device.
> >a. If it's emulated inside the QEMU process then the normal
> >   device emulation code kicks in.
> >b. If it's another vfio-user PCI device then the vfio-user PCI proxy
> >   device forwards the DMA to the second vfio-user server's device B.
> 
> I'm starting to be curious if there's a way to persuade the guest kernel
> to do it for us; in general is there a way to say to PCI devices that
> they can only DMA to the host and not other PCI devices?


But of course - this is how e.g. VFIO protects host PCI devices from
each other when one of them is passed through to a VM.

>  Or that the
> address space of a given PCIe bus is non-overlapping with one of the
> others?



> Dave
> > Stefan
> 
> 
> -- 
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH v2 1/2] linux-user: rt_sigprocmask, check read perms first

2022-01-26 Thread Patrick Venture

On Wed, Jan 26, 2022 at 11:26 AM Laurent Vivier  wrote:

> Le 26/01/2022 à 18:58, Patrick Venture a écrit :
> > From: Shu-Chun Weng 
> >
> > Linux kernel does it this way (checks read permission before validating
> `how`)
> > and the latest version of ABSL's `AddressIsReadable()` depends on this
> > behavior.
> >
> > c.f.
> https://github.com/torvalds/linux/blob/9539ba4308ad5bdca6cb41c7b73cbb9f796dcdd7/kernel/signal.c#L3147
> > Reviewed-by: Patrick Venture 
> > Signed-off-by: Shu-Chun Weng 
>
> Reviewed-by: Laurent Vivier 
>
> but you must resend the patch: you are not the author, but you have to add
> your Signed-off-by.
> (and now you can add my reviewed-by)
>
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/SubmittingPatches?id=f6f94e2ab1b33f0082ac22d71f66385a60d8157f#n296


Thanks! I definitely forgot to sign the patches I wasn't the author -- but
you're right and thanks for pointing me to the guide.


>
>
> Thanks,
> Laurent
>
> > ---
> >   linux-user/syscall.c | 12 +++-
> >   1 file changed, 7 insertions(+), 5 deletions(-)
> >
> > diff --git a/linux-user/syscall.c b/linux-user/syscall.c
> > index 5950222a77..34bd819e38 100644
> > --- a/linux-user/syscall.c
> > +++ b/linux-user/syscall.c
> > @@ -9508,6 +9508,13 @@ static abi_long do_syscall1(void *cpu_env, int
> num, abi_long arg1,
> >   }
> >
> >   if (arg2) {
> > +p = lock_user(VERIFY_READ, arg2,
> sizeof(target_sigset_t), 1);
> > +if (!p) {
> > +return -TARGET_EFAULT;
> > +}
> > +target_to_host_sigset(, p);
> > +unlock_user(p, arg2, 0);
> > +set_ptr = 
> >   switch(how) {
> >   case TARGET_SIG_BLOCK:
> >   how = SIG_BLOCK;
> > @@ -9521,11 +9528,6 @@ static abi_long do_syscall1(void *cpu_env, int
> num, abi_long arg1,
> >   default:
> >   return -TARGET_EINVAL;
> >   }
> > -if (!(p = lock_user(VERIFY_READ, arg2,
> sizeof(target_sigset_t), 1)))
> > -return -TARGET_EFAULT;
> > -target_to_host_sigset(, p);
> > -unlock_user(p, arg2, 0);
> > -set_ptr = 
> >   } else {
> >   how = 0;
> >   set_ptr = NULL;
>
>

Re: [PATCH 2/7] migration: Don't return for postcopy_chunk_hostpages()

2022-01-26 Thread Juan Quintela

Peter Xu  wrote:
> It always return zero, because it just can't go wrong so far.  Simplify the
> code with no functional change.
>
> Signed-off-by: Peter Xu 

Reviewed-by: Juan Quintela 

queued

[PATCH] linux-user: Fix inotify on aarch64

2022-01-26 Thread Paul Brook

The inotify implementation originally called the raw host syscalls.
Commit 3b3f24add0 changed this to use the glibc wrappers. However ifdefs
in syscall.c still test for presence of the raw syscalls.

This causes a problem on e.g. aarch64 hosts which never had the
inotify_init syscall - it had been obsoleted by inotify_init1 before
aarch64 was invented! However it does have a perfectly good glibc
implementation of inotify_wait.

Fix this by removing all the raw __NR_inotify_* tests, and instead check
CONFIG_INOTIFY, which already tests for the glibc functionality we use.

Also remove the now-pointless sys_inotify* wrappers.

Tested using x86-64 inotifywatch on aarch64 host, and vice-versa

Signed-off-by: Paul Brook 
---
 linux-user/fd-trans.c |  5 ++---
 linux-user/syscall.c  | 50 +--
 2 files changed, 12 insertions(+), 43 deletions(-)

diff --git a/linux-user/fd-trans.c b/linux-user/fd-trans.c
index 6941089959..30e7b49112 100644
--- a/linux-user/fd-trans.c
+++ b/linux-user/fd-trans.c
@@ -1460,9 +1460,8 @@ TargetFdTrans target_eventfd_trans = {
 .target_to_host_data = swap_data_eventfd,
 };
 
-#if (defined(TARGET_NR_inotify_init) && defined(__NR_inotify_init)) || \
-(defined(CONFIG_INOTIFY1) && defined(TARGET_NR_inotify_init1) && \
- defined(__NR_inotify_init1))
+#if defined(CONFIG_INOTIFY) && (defined(TARGET_NR_inotify_init) || \
+defined(TARGET_NR_inotify_init1))
 static abi_long host_to_target_data_inotify(void *buf, size_t len)
 {
 struct inotify_event *ev;
diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index 56a3e17183..17cc38fe34 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -272,9 +272,6 @@ static type name (type1 arg1,type2 arg2,type3 arg3,type4 
arg4,type5 arg5,   \
 #if defined(__NR_futex_time64)
 # define __NR_sys_futex_time64 __NR_futex_time64
 #endif
-#define __NR_sys_inotify_init __NR_inotify_init
-#define __NR_sys_inotify_add_watch __NR_inotify_add_watch
-#define __NR_sys_inotify_rm_watch __NR_inotify_rm_watch
 #define __NR_sys_statx __NR_statx
 
 #if defined(__alpha__) || defined(__x86_64__) || defined(__s390x__)
@@ -447,33 +444,6 @@ static int sys_renameat2(int oldfd, const char *old,
 
 #ifdef CONFIG_INOTIFY
 #include 
-
-#if defined(TARGET_NR_inotify_init) && defined(__NR_inotify_init)
-static int sys_inotify_init(void)
-{
-  return (inotify_init());
-}
-#endif
-#if defined(TARGET_NR_inotify_add_watch) && defined(__NR_inotify_add_watch)
-static int sys_inotify_add_watch(int fd,const char *pathname, int32_t mask)
-{
-  return (inotify_add_watch(fd, pathname, mask));
-}
-#endif
-#if defined(TARGET_NR_inotify_rm_watch) && defined(__NR_inotify_rm_watch)
-static int sys_inotify_rm_watch(int fd, int32_t wd)
-{
-  return (inotify_rm_watch(fd, wd));
-}
-#endif
-#ifdef CONFIG_INOTIFY1
-#if defined(TARGET_NR_inotify_init1) && defined(__NR_inotify_init1)
-static int sys_inotify_init1(int flags)
-{
-  return (inotify_init1(flags));
-}
-#endif
-#endif
 #else
 /* Userspace can usually survive runtime without inotify */
 #undef TARGET_NR_inotify_init
@@ -12263,35 +12233,35 @@ static abi_long do_syscall1(void *cpu_env, int num, 
abi_long arg1,
 case TARGET_NR_futex_time64:
 return do_futex_time64(cpu, arg1, arg2, arg3, arg4, arg5, arg6);
 #endif
-#if defined(TARGET_NR_inotify_init) && defined(__NR_inotify_init)
+#ifdef CONFIG_INOTIFY
+#if defined(TARGET_NR_inotify_init)
 case TARGET_NR_inotify_init:
-ret = get_errno(sys_inotify_init());
+ret = get_errno(inotify_init());
 if (ret >= 0) {
 fd_trans_register(ret, _inotify_trans);
 }
 return ret;
 #endif
-#ifdef CONFIG_INOTIFY1
-#if defined(TARGET_NR_inotify_init1) && defined(__NR_inotify_init1)
+#if defined(TARGET_NR_inotify_init1) && defined(CONFIG_INOTIFY1)
 case TARGET_NR_inotify_init1:
-ret = get_errno(sys_inotify_init1(target_to_host_bitmask(arg1,
+ret = get_errno(inotify_init1(target_to_host_bitmask(arg1,
   fcntl_flags_tbl)));
 if (ret >= 0) {
 fd_trans_register(ret, _inotify_trans);
 }
 return ret;
 #endif
-#endif
-#if defined(TARGET_NR_inotify_add_watch) && defined(__NR_inotify_add_watch)
+#if defined(TARGET_NR_inotify_add_watch)
 case TARGET_NR_inotify_add_watch:
 p = lock_user_string(arg2);
-ret = get_errno(sys_inotify_add_watch(arg1, path(p), arg3));
+ret = get_errno(inotify_add_watch(arg1, path(p), arg3));
 unlock_user(p, arg2, 0);
 return ret;
 #endif
-#if defined(TARGET_NR_inotify_rm_watch) && defined(__NR_inotify_rm_watch)
+#if defined(TARGET_NR_inotify_rm_watch)
 case TARGET_NR_inotify_rm_watch:
-return get_errno(sys_inotify_rm_watch(arg1, arg2));
+return get_errno(inotify_rm_watch(arg1, arg2));
+#endif
 #endif
 
 #if defined(TARGET_NR_mq_open) && defined(__NR_mq_open)
-- 
2.34.1

Re: [PATCH 1/7] migration: Drop dead code of ram_debug_dump_bitmap()

2022-01-26 Thread Juan Quintela

Peter Xu  wrote:
> I planned to add "#ifdef DEBUG_POSTCOPY" around the function too because
> otherwise it'll be compiled into qemu binary even if it'll never be used.  
> Then
> I found that maybe it's easier to just drop it for good..
>
> Signed-off-by: Peter Xu 

Reviewed-by: Juan Quintela 

queued

Re: [PATCH] migration/ram: clean up unused comment.

2022-01-26 Thread Juan Quintela

Xu Zheng  wrote:
> Just a removal of an unused comment.
> a0a8aa147aa did many fixes and removed the parameter named "ms", but
> forget to remove the corresponding comment in function named
> "ram_save_host_page".
>
> Signed-off-by: Xu Zheng 
> Signed-off-by: Mao Zhongyi 

Reviewed-by: Juan Quintela 

queued.

> ---
>  migration/ram.c | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/migration/ram.c b/migration/ram.c
> index 57efa67f20..96944f0c70 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -2188,7 +2188,6 @@ static int ram_save_target_page(RAMState *rs, 
> PageSearchStatus *pss,

Somehow git diff got confised here, and shows ram_save_target_page where
it is ram_save_host_page().  I fixued it manually.


>   * Returns the number of pages written or negative on error
>   *
>   * @rs: current RAM state
> - * @ms: current migration state
>   * @pss: data about the page we want to send
>   * @last_stage: if we are at the completion stage
>   */

Re: [PATCH v2 3/3] migration: Perform vmsd structure check during tests

2022-01-26 Thread Juan Quintela

"Dr. David Alan Gilbert (git)"  wrote:
> From: "Dr. David Alan Gilbert" 
>
> Perform a check on vmsd structures during test runs in the hope
> of catching any missing terminators and other simple screwups.
>
> Signed-off-by: Dr. David Alan Gilbert 

Reviewed-by: Juan Quintela 

queued.

Re: [PATCH v2 1/3] ppc: Fix vmstate_pbr403 subsection name

2022-01-26 Thread Juan Quintela

"Dr. David Alan Gilbert (git)"  wrote:
> From: "Dr. David Alan Gilbert" 
>
> The pbr403 subsection is part of the tlb6xx state, so I believe it's
> name needs to be:
>
> .name = "cpu/tlb6xx/pbr403",
>
> Signed-off-by: Dr. David Alan Gilbert 

Dropped as code don't exist anymore.


> ---
>  target/ppc/machine.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/target/ppc/machine.c b/target/ppc/machine.c
> index 756d8de5d8..e535edb7c4 100644
> --- a/target/ppc/machine.c
> +++ b/target/ppc/machine.c
> @@ -718,7 +718,7 @@ static bool pbr403_needed(void *opaque)
>  }
>  
>  static const VMStateDescription vmstate_pbr403 = {
> -.name = "cpu/pbr403",
> +.name = "cpu/tlb6xx/pbr403",
>  .version_id = 1,
>  .minimum_version_id = 1,
>  .needed = pbr403_needed,

Re: [PATCH] migration/colo.c: Add missed return in error handling

2022-01-26 Thread Juan Quintela

Rao Lei  wrote:
> When doing failover and checkpoint, some returns are missed in error
> handling. Let's add it.
>
> Signed-off-by: Lei Rao 
> ---
>  migration/colo.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/migration/colo.c b/migration/colo.c
> index 5f7071b3cd..014d3cba01 100644
> --- a/migration/colo.c
> +++ b/migration/colo.c
> @@ -94,12 +94,15 @@ static void secondary_vm_do_failover(void)
>  if (local_err) {
>  error_report_err(local_err);
>  local_err = NULL;
> +return;

Assign a local variable before a return is a NOP, so remove the assignmenent?

>  }
>  
>  /* Notify all filters of all NIC to do checkpoint */
>  colo_notify_filters_event(COLO_EVENT_FAILOVER, _err);
>  if (local_err) {
>  error_report_err(local_err);
> +local_err = NULL;
> +return;

Same here.

>  }
>  
>  if (!autostart) {
> @@ -178,6 +181,7 @@ static void primary_vm_do_failover(void)
>  if (local_err) {
>  error_report_err(local_err);
>  local_err = NULL;
> +return;

And here.

>  }
>  
>  /* Notify COLO thread that failover work is finished */
> @@ -507,12 +511,11 @@ static int 
> colo_do_checkpoint_transaction(MigrationState *s,
>  goto out;
>  }
>  
> -ret = 0;
> -
>  qemu_mutex_lock_iothread();
>  vm_start();
>  qemu_mutex_unlock_iothread();
>  trace_colo_vm_state_change("stop", "run");
> +return 0;
>  
>  out:
>  if (local_err) {

This is really a NOP, but it is one line less, so I will not complain.

But ther, it is better to just rename the label from "out" to "error" or
something like that.

Later, Juan.

Re: [PATCH] migration: Report the error returned when save_live_iterate fails

2022-01-26 Thread Juan Quintela

David Edmondson  wrote:
> Should qemu_savevm_state_iterate() encounter a failure when calling a
> particular save_live_iterate function, report the error code returned
> by the function.
>
> Signed-off-by: David Edmondson 

Reviewed-by: Juan Quintela 

queued

[PATCH v2 2/2] ppc/pnv: use a do-while() loop in pnv_phb4_translate_tve()

2022-01-26 Thread Daniel Henrique Barboza

pnv_phb4_translate_tve() is quite similar to pnv_phb3_translate_tve(),
and that includes the fact that 'taddr' can be considered uninitialized
when throwing the "TCE access fault" error because, in theory, the loop
that sets 'taddr' can be skippable due to 'lev' being an signed int.

No one complained about this specific case yet, but since we took the
time to handle the same situtation in pnv_phb3_translate_tve(), let's
replicate it here as well.

Signed-off-by: Daniel Henrique Barboza 
---
 hw/pci-host/pnv_phb4.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/hw/pci-host/pnv_phb4.c b/hw/pci-host/pnv_phb4.c
index a78add75b0..88a1479831 100644
--- a/hw/pci-host/pnv_phb4.c
+++ b/hw/pci-host/pnv_phb4.c
@@ -1261,13 +1261,21 @@ static void pnv_phb4_translate_tve(PnvPhb4DMASpace *ds, 
hwaddr addr,
 /* Top level table base address */
 base = tta << 12;
 
+/*
+ * There were reports of compilers complaining about 'taddr'
+ * being used uninitialized in pnv_phb3_translate_tve(), and
+ * the same scenario is happening here. Initialize 'taddr'
+ * just in case.
+ */
+taddr = base;
+
 /* Total shift to first level */
 sh = tbl_shift * lev + tce_shift;
 
 /* TODO: Limit to support IO page sizes */
 
 /* TODO: Multi-level untested */
-while ((lev--) >= 0) {
+do {
 /* Grab the TCE address */
 taddr = base | (((addr >> sh) & ((1ul << tbl_shift) - 1)) << 3);
 if (dma_memory_read(_space_memory, taddr, ,
@@ -1288,7 +1296,7 @@ static void pnv_phb4_translate_tve(PnvPhb4DMASpace *ds, 
hwaddr addr,
 }
 sh -= tbl_shift;
 base = tce & ~0xfffull;
-}
+} while ((lev--) >= 0);
 
 /* We exit the loop with TCE being the final TCE */
 tce_mask = ~((1ull << tce_shift) - 1);
-- 
2.34.1

[PATCH v2 1/2] ppc/pnv: use a do-while() loop in pnv_phb3_translate_tve()

2022-01-26 Thread Daniel Henrique Barboza

The 'taddr' variable is left unintialized, being set only inside the
"while ((lev--) >= 0)" loop where we get the TCE address. The 'lev' var
is an int32_t that is being initiliazed by the GETFIELD() macro, which
returns an uint64_t.

For a human reader this means that 'lev' will always be positive or zero.
But some compilers may beg to differ. 'lev' being an int32_t can in theory
be set as negative, and the "while ((lev--) >= 0)" loop might never be
reached, and 'taddr' will be left unitialized. This can cause phb3_error()
to use 'taddr' uninitialized down below:

if ((is_write & !(tce & 2)) || ((!is_write) && !(tce & 1))) {
phb3_error(phb, "TCE access fault at 0x%"PRIx64, taddr);

A quick way of fixing it is to use a do/while() loop. This will keep the
same semanting as the existing while() loop does and the compiler will
understand that 'taddr' will be initialized at least once.

Suggested-by: Matheus K. Ferst 
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/573
Signed-off-by: Daniel Henrique Barboza 
---
 hw/pci-host/pnv_phb3.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/pci-host/pnv_phb3.c b/hw/pci-host/pnv_phb3.c
index 7fb35dc031..39a6184419 100644
--- a/hw/pci-host/pnv_phb3.c
+++ b/hw/pci-host/pnv_phb3.c
@@ -792,7 +792,7 @@ static void pnv_phb3_translate_tve(PnvPhb3DMASpace *ds, 
hwaddr addr,
 sh = tbl_shift * lev + tce_shift;
 
 /* TODO: Multi-level untested */
-while ((lev--) >= 0) {
+do {
 /* Grab the TCE address */
 taddr = base | (((addr >> sh) & ((1ul << tbl_shift) - 1)) << 3);
 if (dma_memory_read(_space_memory, taddr, ,
@@ -813,7 +813,7 @@ static void pnv_phb3_translate_tve(PnvPhb3DMASpace *ds, 
hwaddr addr,
 }
 sh -= tbl_shift;
 base = tce & ~0xfffull;
-}
+} while ((lev--) >= 0);
 
 /* We exit the loop with TCE being the final TCE */
 tce_mask = ~((1ull << tce_shift) - 1);
-- 
2.34.1

[PATCH v2 0/2] use a do-while() loop in pnv_phbX_translate_tve()

2022-01-26 Thread Daniel Henrique Barboza

Hi,

This second version implemented Matheus' suggestion of using a
do-while() loop instead of initializing variables.

v1 link: https://lists.gnu.org/archive/html/qemu-devel/2022-01/msg05475.html

Daniel Henrique Barboza (2):
  ppc/pnv: use a do-while() loop in pnv_phb3_translate_tve()
  ppc/pnv: use a do-while() loop in pnv_phb4_translate_tve()

 hw/pci-host/pnv_phb3.c |  4 ++--
 hw/pci-host/pnv_phb4.c | 12 ++--
 2 files changed, 12 insertions(+), 4 deletions(-)

-- 
2.34.1

Re: [PATCH 1/2] ppc/pnv: initialize 'taddr' in pnv_phb3_translate_tve()

2022-01-26 Thread Daniel Henrique Barboza





On 1/26/22 14:28, Matheus K. Ferst wrote:

On 26/01/2022 10:41, Daniel Henrique Barboza wrote:

The 'taddr' variable is left unintialized, being set only inside the
"while ((lev--) >= 0)" loop where we get the TCE address. The 'lev' var
is an int32_t that is being initiliazed by the GETFIELD() macro, which
returns an uint64_t.

For a human reader this means that 'lev' will always be positive or zero.
But some compilers may beg to differ. 'lev' being an int32_t can in theory
be set as negative, and the "while ((lev--) >= 0)" loop might never be
reached, and 'taddr' will be left unitialized.


If we expect this code to execute at least once, wouldn't it be better to use a 
do-while? E.g.:

do {
     lev--;

     /* Grab the TCE address */
     taddr = base | (((addr >> sh) & ((1ul << tbl_shift) - 1)) << 3);
     if (dma_memory_read(_space_memory, taddr, ,
     /* ... */
     }
     sh -= tbl_shift;
     base = tce & ~0xfffull;
} while (lev >= 0);

Otherwise, I think we'll need to initialize tce too.


Initializing tce isn't necessary, at least as far as compiler warning goes,
because tce will be defaulted to zero and its current use (tce & 3, tce & 2,
tce & 1 operations) isn't offending the compiler. For now at least.

That said, I think using a do/while() loop is an idea that fixes the issue while
keeping the code flow, without having to add extra initializations, so I ended
up changing it as you suggested.


Thanks,


Daniel



Thanks,
Matheus K. Ferst
Instituto de Pesquisas ELDORADO 
Analista de Software
Aviso Legal - Disclaimer

Re: [PATCH v5 03/18] pci: isolated address space for PCI bus

2022-01-26 Thread Dr. David Alan Gilbert

* Stefan Hajnoczi (stefa...@redhat.com) wrote:
> On Wed, Jan 26, 2022 at 05:27:32AM +, Jag Raman wrote:
> > 
> > 
> > > On Jan 25, 2022, at 1:38 PM, Dr. David Alan Gilbert  
> > > wrote:
> > > 
> > > * Jag Raman (jag.ra...@oracle.com) wrote:
> > >> 
> > >> 
> > >>> On Jan 19, 2022, at 7:12 PM, Michael S. Tsirkin  wrote:
> > >>> 
> > >>> On Wed, Jan 19, 2022 at 04:41:52PM -0500, Jagannathan Raman wrote:
> >  Allow PCI buses to be part of isolated CPU address spaces. This has a
> >  niche usage.
> >  
> >  TYPE_REMOTE_MACHINE allows multiple VMs to house their PCI devices in
> >  the same machine/server. This would cause address space collision as
> >  well as be a security vulnerability. Having separate address spaces for
> >  each PCI bus would solve this problem.
> > >>> 
> > >>> Fascinating, but I am not sure I understand. any examples?
> > >> 
> > >> Hi Michael!
> > >> 
> > >> multiprocess QEMU and vfio-user implement a client-server model to allow
> > >> out-of-process emulation of devices. The client QEMU, which makes ioctls
> > >> to the kernel and runs VCPUs, could attach devices running in a server
> > >> QEMU. The server QEMU needs access to parts of the client’s RAM to
> > >> perform DMA.
> > > 
> > > Do you ever have the opposite problem? i.e. when an emulated PCI device
> > 
> > That’s an interesting question.
> > 
> > > exposes a chunk of RAM-like space (frame buffer, or maybe a mapped file)
> > > that the client can see.  What happens if two emulated devices need to
> > > access each others emulated address space?
> > 
> > In this case, the kernel driver would map the destination’s chunk of 
> > internal RAM into
> > the DMA space of the source device. Then the source device could write to 
> > that
> > mapped address range, and the IOMMU should direct those writes to the
> > destination device.
> > 
> > I would like to take a closer look at the IOMMU implementation on how to 
> > achieve
> > this, and get back to you. I think the IOMMU would handle this. Could you 
> > please
> > point me to the IOMMU implementation you have in mind?
> 
> I don't know if the current vfio-user client/server patches already
> implement device-to-device DMA, but the functionality is supported by
> the vfio-user protocol.
> 
> Basically: if the DMA regions lookup inside the vfio-user server fails,
> fall back to VFIO_USER_DMA_READ/WRITE messages instead.
> https://github.com/nutanix/libvfio-user/blob/master/docs/vfio-user.rst#vfio-user-dma-read
> 
> Here is the flow:
> 1. The vfio-user server with device A sends a DMA read to QEMU.
> 2. QEMU finds the MemoryRegion associated with the DMA address and sees
>it's a device.
>a. If it's emulated inside the QEMU process then the normal
>   device emulation code kicks in.
>b. If it's another vfio-user PCI device then the vfio-user PCI proxy
>   device forwards the DMA to the second vfio-user server's device B.

I'm starting to be curious if there's a way to persuade the guest kernel
to do it for us; in general is there a way to say to PCI devices that
they can only DMA to the host and not other PCI devices?  Or that the
address space of a given PCIe bus is non-overlapping with one of the
others?

Dave

> Stefan


-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH 0/8] target/ppc: powerpc_excp improvements [74xx] (5/n)

2022-01-26 Thread BALATON Zoltan


On Wed, 26 Jan 2022, Cédric Le Goater wrote:

On 1/26/22 18:55, BALATON Zoltan wrote:

On Wed, 26 Jan 2022, Fabiano Rosas wrote:

This handles the exception code for the 74xx family, i.e. 7400, 7410,
7440, 7445, 7450, 7455, 7457.

This is another family that is quite well known, so it should be
straight-forward as well.


I guess this is what may break VOF on pegasos2. Was Philippe's test case 
for this machine ever merged? (Although that may use the firmware ROM that 
was preferred as it tests more of the machine and may predate VOF so not 
sure it also tests with VOF.) The way to test it is this:


Get morphos demo ISO from https://www.morphos-team.net/morphos-3.15.iso
Extract boot.img from the root directory of the CD
Run QEMU as shown at http://zero.eik.bme.hu/~balaton/qemu/amiga/#morphos


I could never make it work :/


Philippe's test or the MorphOS iso? If the MorphOS boot, what problem do 
you get? I don't have or use avocado so can't comment on the former.


(For debugging maybe enabling vof traces would give more info but it was a 
while so I don't remember the details any more.)



Based on legoater/ppc-7.0


I could test when it's merged or when it applies on master but I don't 
usually test on branches. Did you verify it still works with pegasos2 or 
could you please make sure it won't break that use case?


I have a large set of images, here, that I use for non regression tests :

 https://github.com/legoater/qemu-ppc-boot

If we could add a pegasos, that would be nice.


I'm not sure how could I help with that. Once you can boot from command 
line it should be easy to add to the script as well.


Regards,
BALATON Zoltan

Re: [PATCH 3/3] migration/migration.c: Remove the MIGRATION_STATUS_ACTIVE when migration finished

2022-01-26 Thread Juan Quintela

Zhang Chen  wrote:
> The MIGRATION_STATUS_ACTIVE indicates that migration is running.
> Remove it to be handled by the default operation,
> It should be part of the unknown ending states.
>
> Signed-off-by: Zhang Chen 

Reviewed-by: Juan Quintela 

queued

Re: [PATCH v2 2/2] linux-user: sigprocmask check read perms first

2022-01-26 Thread Laurent Vivier


Le 26/01/2022 à 18:58, Patrick Venture a écrit :

Linux kernel now checks the read permissions before validating `how`

Suggested-by: Laurent Vivier 
Signed-off-by: Patrick Venture 


Reviewed-by: Laurent Vivier 


---
  linux-user/syscall.c | 12 +++-
  1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index 34bd819e38..210483d4e4 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -9465,6 +9465,13 @@ static abi_long do_syscall1(void *cpu_env, int num, 
abi_long arg1,
  int how;
  
  if (arg2) {

+p = lock_user(VERIFY_READ, arg2, sizeof(target_sigset_t), 1));
+if (!p) {
+return -TARGET_EFAULT;
+}
+target_to_host_old_sigset(, p);
+unlock_user(p, arg2, 0);
+set_ptr = 
  switch (arg1) {
  case TARGET_SIG_BLOCK:
  how = SIG_BLOCK;
@@ -9478,11 +9485,6 @@ static abi_long do_syscall1(void *cpu_env, int num, 
abi_long arg1,
  default:
  return -TARGET_EINVAL;
  }
-if (!(p = lock_user(VERIFY_READ, arg2, 
sizeof(target_sigset_t), 1)))
-return -TARGET_EFAULT;
-target_to_host_old_sigset(, p);
-unlock_user(p, arg2, 0);
-set_ptr = 
  } else {
  how = 0;
  set_ptr = NULL;

Re: [PATCH 2/3] migration/migration.c: Avoid COLO boot in postcopy migration

2022-01-26 Thread Juan Quintela

Zhang Chen  wrote:
> COLO dose not support postcopy migration and remove the Fixme.
>
> Signed-off-by: Zhang Chen 

Reviewed-by: Juan Quintela 

fixed the typo.

Re: [PATCH v2 1/2] linux-user: rt_sigprocmask, check read perms first

2022-01-26 Thread Laurent Vivier


Le 26/01/2022 à 18:58, Patrick Venture a écrit :

From: Shu-Chun Weng 

Linux kernel does it this way (checks read permission before validating `how`)
and the latest version of ABSL's `AddressIsReadable()` depends on this
behavior.

c.f.  
https://github.com/torvalds/linux/blob/9539ba4308ad5bdca6cb41c7b73cbb9f796dcdd7/kernel/signal.c#L3147
Reviewed-by: Patrick Venture 
Signed-off-by: Shu-Chun Weng 


Reviewed-by: Laurent Vivier 

but you must resend the patch: you are not the author, but you have to add your 
Signed-off-by.
(and now you can add my reviewed-by)

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/SubmittingPatches?id=f6f94e2ab1b33f0082ac22d71f66385a60d8157f#n296

Thanks,
Laurent


---
  linux-user/syscall.c | 12 +++-
  1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index 5950222a77..34bd819e38 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -9508,6 +9508,13 @@ static abi_long do_syscall1(void *cpu_env, int num, 
abi_long arg1,
  }
  
  if (arg2) {

+p = lock_user(VERIFY_READ, arg2, sizeof(target_sigset_t), 1);
+if (!p) {
+return -TARGET_EFAULT;
+}
+target_to_host_sigset(, p);
+unlock_user(p, arg2, 0);
+set_ptr = 
  switch(how) {
  case TARGET_SIG_BLOCK:
  how = SIG_BLOCK;
@@ -9521,11 +9528,6 @@ static abi_long do_syscall1(void *cpu_env, int num, 
abi_long arg1,
  default:
  return -TARGET_EINVAL;
  }
-if (!(p = lock_user(VERIFY_READ, arg2, 
sizeof(target_sigset_t), 1)))
-return -TARGET_EFAULT;
-target_to_host_sigset(, p);
-unlock_user(p, arg2, 0);
-set_ptr = 
  } else {
  how = 0;
  set_ptr = NULL;

Re: [PATCH 1/3] migration/migration.c: Add missed default error handler for migration state

2022-01-26 Thread Juan Quintela

Zhang Chen  wrote:
> In the migration_completion() no other status is expected, for
> example MIGRATION_STATUS_CANCELLING, MIGRATION_STATUS_CANCELLED, etc.
>
> Signed-off-by: Zhang Chen 

Reviewed-by: Juan Quintela 

queued

Re: [PATCH v7 05/10] hw/dma/xlnx_csu_dma: Support starting a read transfer through a class method

2022-01-26 Thread Luc Michel

On 16:11 Fri 21 Jan , Francisco Iglesias wrote:
> An option on real hardware when embedding a DMA engine into a peripheral
> is to make the peripheral control the engine through a custom DMA control
> (hardware) interface between the two. Software drivers in this scenario
> configure and trigger DMA operations through the controlling peripheral's
> register API (for example, writing a specific bit in a register could
> propagate down to a transfer start signal on the DMA control interface).
> At the same time the status, results and interrupts for the transfer might
> still be intended to be read and caught through the DMA engine's register
> API (and signals).
> 
> This patch adds a class 'read' method for allowing to start read transfers
> from peripherals embedding and controlling the Xilinx CSU DMA engine as in
> above scenario.
> 
> Signed-off-by: Francisco Iglesias 

Reviewed-by: Luc Michel 

> ---
>  include/hw/dma/xlnx_csu_dma.h | 19 +--
>  hw/dma/xlnx_csu_dma.c | 17 +
>  2 files changed, 34 insertions(+), 2 deletions(-)
> 
> diff --git a/include/hw/dma/xlnx_csu_dma.h b/include/hw/dma/xlnx_csu_dma.h
> index 28806628b1..922ab80eb6 100644
> --- a/include/hw/dma/xlnx_csu_dma.h
> +++ b/include/hw/dma/xlnx_csu_dma.h
> @@ -51,7 +51,22 @@ typedef struct XlnxCSUDMA {
>  RegisterInfo regs_info[XLNX_CSU_DMA_R_MAX];
>  } XlnxCSUDMA;
>  
> -#define XLNX_CSU_DMA(obj) \
> -OBJECT_CHECK(XlnxCSUDMA, (obj), TYPE_XLNX_CSU_DMA)
> +OBJECT_DECLARE_TYPE(XlnxCSUDMA, XlnxCSUDMAClass, XLNX_CSU_DMA)
> +
> +struct XlnxCSUDMAClass {
> +SysBusDeviceClass parent_class;
> +
> +/*
> + * read: Start a read transfer on a Xilinx CSU DMA engine
> + *
> + * @s: the Xilinx CSU DMA engine to start the transfer on
> + * @addr: the address to read
> + * @len: the number of bytes to read at 'addr'
> + *
> + * @return a MemTxResult indicating whether the operation succeeded 
> ('len'
> + * bytes were read) or failed.
> + */
> +MemTxResult (*read)(XlnxCSUDMA *s, hwaddr addr, uint32_t len);
> +};
>  
>  #endif
> diff --git a/hw/dma/xlnx_csu_dma.c b/hw/dma/xlnx_csu_dma.c
> index 896bb3574d..095f954476 100644
> --- a/hw/dma/xlnx_csu_dma.c
> +++ b/hw/dma/xlnx_csu_dma.c
> @@ -472,6 +472,20 @@ static uint64_t addr_msb_pre_write(RegisterInfo *reg, 
> uint64_t val)
>  return val & R_ADDR_MSB_ADDR_MSB_MASK;
>  }
>  
> +static MemTxResult xlnx_csu_dma_class_read(XlnxCSUDMA *s, hwaddr addr,
> +   uint32_t len)
> +{
> +RegisterInfo *reg = >regs_info[R_SIZE];
> +uint64_t we = MAKE_64BIT_MASK(0, 4 * 8);
> +
> +s->regs[R_ADDR] = addr;
> +s->regs[R_ADDR_MSB] = (uint64_t)addr >> 32;
> +
> +register_write(reg, len, we, object_get_typename(OBJECT(s)), false);
> +
> +return (s->regs[R_SIZE] == 0) ? MEMTX_OK : MEMTX_ERROR;
> +}
> +
>  static const RegisterAccessInfo *xlnx_csu_dma_regs_info[] = {
>  #define DMACH_REGINFO(NAME, snd) 
>  \
>  (const RegisterAccessInfo []) {  
>  \
> @@ -696,6 +710,7 @@ static void xlnx_csu_dma_class_init(ObjectClass *klass, 
> void *data)
>  {
>  DeviceClass *dc = DEVICE_CLASS(klass);
>  StreamSinkClass *ssc = STREAM_SINK_CLASS(klass);
> +XlnxCSUDMAClass *xcdc = XLNX_CSU_DMA_CLASS(klass);
>  
>  dc->reset = xlnx_csu_dma_reset;
>  dc->realize = xlnx_csu_dma_realize;
> @@ -704,6 +719,8 @@ static void xlnx_csu_dma_class_init(ObjectClass *klass, 
> void *data)
>  
>  ssc->push = xlnx_csu_dma_stream_push;
>  ssc->can_push = xlnx_csu_dma_stream_can_push;
> +
> +xcdc->read = xlnx_csu_dma_class_read;
>  }
>  
>  static void xlnx_csu_dma_init(Object *obj)
> -- 
> 2.11.0
> 

--

Notes on Generating Python signatures for QMP RPCs

2022-01-26 Thread John Snow

Hiya, I was experimenting with $subject and ran into a few points of
interest. This is basically an informal status report from me. I've
CC'd some of the usual suspects for people who care about SDKs and API
design and such.

This is just a list of some observations I had, so not everything
below is a question or an action item. Just sharing some notes.

(0) This experiment concerned generating signatures based on
introspection data, dynamically at runtime. In this environment type
hints are not required, as they are not actually used at runtime.
However, I added them anyway as an exercise for dynamic documentation
purposes. (i.e. `help(auto_generated_function)` showing type hints can
still be useful -- especially without access to QAPI doc blocks.)
Determining type information is also necessary for generating the
marshaling/unmarshaling functions to communicate with the server.

(1) QAPI types the return of many commands as an empty object. That's
literally indeed what happens on the wire, and it makes sense in that
if these commands were ever to return anything, it is a "compatible
evolution" to include new fields in such an object. In Python, this
does not make much sense, though; as this is somewhat hard to
annotate:

async def stop() -> Literal[{}]: ...

The more pythonic signature is:

async def stop() -> None: ...

I feel like it's spiritually equivalent, but I am aware it is a
distinct choice that is being made. It could theoretically interfere
with a choice made in QAPI later to explicitly return Null. I don't
think we'd do that, but it's still a choice of abstraction that
reduces the resolution of distinct return signatures.

(1.5) Do we have a formal definition for what we consider to be a
"compatible evolution" of the schema? I've got a fairly good idea, but
I am not sure how it's enforced. Is it just Markus being very
thorough? If we add more languages to the generator, we probably can't
burden Markus with knowing how to protect the compatibility of every
generator. We might need more assertions for invariants in the
generator itself ... but to protect "evolution", we need points of
reference to test against. Do we have anything for this? Do we need
one? Should I write a test?

(2) There are five commands that are exempted from returning an
object. qom-get is one. However, what I didn't really explicitly
realize is that this doesn't mean that only five commands don't return
an object -- we also actually allow for a list of objects, which
*many* commands use. There's no technical issue here, just an
observation. It is no problem at all to annotate Python commands as
"-> SomeReturnType" or "-> List[SomeDifferentReturnType]" or even "->
str:" as needed.

(3) Over the wire, the order of arguments to QMP commands is not
specified. In generating commands procedurally from introspection
data, I am made aware that there are several commands in which
"optional" arguments precede "required" arguments. This means that
function generation in Python cannot match the stated order 1:1.

That's not a crisis per se. For generating functions, we can use a
stable sort to bubble-up the required arguments, leaving the optional
ones trailing. However, it does mean that depending on how the QAPI
schema is modified in the future, the argument order may change
between versions of a generative SDK. I'd like to avoid that, if I
can.

One trick I have available to me in Python is the ability to stipulate
that all (QAPI) "optional" arguments are keyword-only. This means that
Optional parameters can be re-ordered arbitrarily without any breakage
in the generative python API. The only remaining concern is if the
*mandatory* arguments are re-ordered.

(In fact, I could stipulate that ALL arguments in Python bindings are
keyword-only, but I think that's going overboard and hurts usability
and readability.)

Marc-Andre has mentioned this before, but it might be nice to actually
specify a canonical ordering of arguments for languages that require
such things, and to make sure that we do not break this ordering
without good reason.

(Of course, SDK stability is not fully possible, and if this
functionality is desired, then it's time to use libvirt, hint hint
hint! However, we can avoid pointless churn in generated code and make
it easier to use and experiment with.)

(4) StrOrNull is a tricky design problem.

In Python, generally, omitted arguments are typed like this:
async def example_command(arg: Optional[int] = None) -> None: ...

Most Python programmers would recognize that signature as meaning that
they can omit 'arg' and some hopefully good default will be chosen.
However, in QAPI we do have the case where "omitted" is distinct from
"explicitly provided null". This is ... a bit challenging to convey
semantically. Python does not offer the ability to tell "what kind of
None" it received; i.e. unlike our generated QMP marshalling
functions, we do not have a "has_arg" boolean we can inspect.

So how do we write a function

Re: [PATCH v7 06/10] hw/ssi: Add a model of Xilinx Versal's OSPI flash memory controller

2022-01-26 Thread Luc Michel

Hi Francisco,

On 16:11 Fri 21 Jan , Francisco Iglesias wrote:
> Add a model of Xilinx Versal's OSPI flash memory controller.
> 
> Signed-off-by: Francisco Iglesias 
> ---
>  include/hw/ssi/xlnx-versal-ospi.h |  111 +++
>  hw/ssi/xlnx-versal-ospi.c | 1853 
> +
>  hw/ssi/meson.build|1 +
>  3 files changed, 1965 insertions(+)
>  create mode 100644 include/hw/ssi/xlnx-versal-ospi.h
>  create mode 100644 hw/ssi/xlnx-versal-ospi.c
> 
> diff --git a/include/hw/ssi/xlnx-versal-ospi.h 
> b/include/hw/ssi/xlnx-versal-ospi.h
> new file mode 100644
> index 00..14d1263497
> --- /dev/null
> +++ b/include/hw/ssi/xlnx-versal-ospi.h
> @@ -0,0 +1,111 @@

[snip]

> +static void ospi_ahb_decoder_enable_cs(XlnxVersalOspi *s, hwaddr addr)
> +{
> +int cs = ospi_ahb_decoder_cs(s, addr);
> +
> +if (cs >= 0) {
> +for (int i = 0; i < s->num_cs; i++) {
> +qemu_set_irq(s->cs_lines[i], cs != i);
If you re-send a series, indent is off here.

Otherwise:

Reviewed-by: Luc Michel

Re: [PATCH v4 06/42] hw/cxl/device: Implement basic mailbox (8.2.8.4)

2022-01-26 Thread Alex Bennée



Jonathan Cameron  writes:

> From: Ben Widawsky 
>
> This is the beginning of implementing mailbox support for CXL 2.0
> devices. The implementation recognizes when the doorbell is rung,
> handles the command/payload, clears the doorbell while returning error
> codes and data.
>
> Generally the mailbox mechanism is designed to permit communication
> between the host OS and the firmware running on the device. For our
> purposes, we emulate both the firmware, implemented primarily in
> cxl-mailbox-utils.c, and the hardware.
>
> No commands are implemented yet.
>
> Signed-off-by: Ben Widawsky 
> Signed-off-by: Jonathan Cameron 
> ---
>  hw/cxl/cxl-device-utils.c   | 129 ++-
>  hw/cxl/cxl-mailbox-utils.c  | 201 
>  hw/cxl/meson.build  |   1 +
>  include/hw/cxl/cxl.h|   3 +
>  include/hw/cxl/cxl_device.h |  29 +-
>  5 files changed, 358 insertions(+), 5 deletions(-)
>
> diff --git a/hw/cxl/cxl-device-utils.c b/hw/cxl/cxl-device-utils.c
> index cb1b0a8217..c9ada7ee94 100644
> --- a/hw/cxl/cxl-device-utils.c
> +++ b/hw/cxl/cxl-device-utils.c
> @@ -40,6 +40,115 @@ static uint64_t dev_reg_read(void *opaque, hwaddr offset, 
> unsigned size)
>  return 0;
>  }
>  
> +static uint64_t mailbox_reg_read(void *opaque, hwaddr offset, unsigned size)
> +{
> +CXLDeviceState *cxl_dstate = opaque;
> +
> +switch (size) {
> +case 1:
> +return cxl_dstate->mbox_reg_state[offset];
> +case 2:
> +return cxl_dstate->mbox_reg_state16[offset / 2];
> +case 4:
> +return cxl_dstate->mbox_reg_state32[offset / 4];
> +case 8:
> +return cxl_dstate->mbox_reg_state64[offset / 8];
> +default:
> +g_assert_not_reached();
> +}
> +}
> +
> +static void mailbox_mem_writel(uint32_t *reg_state, hwaddr offset,
> +   uint64_t value)
> +{
> +switch (offset) {
> +case A_CXL_DEV_MAILBOX_CTRL:
> +/* fallthrough */
> +case A_CXL_DEV_MAILBOX_CAP:
> +/* RO register */
> +break;
> +default:
> +qemu_log_mask(LOG_UNIMP,
> +  "%s Unexpected 32-bit access to 0x%" PRIx64 " (WI)\n",
> +  __func__, offset);
> +break;

The log message implies WI but it isn't. Did you want an early return?

> +}
> +
> +reg_state[offset / 4] = value;
> +}
> +
> +static void mailbox_mem_writeq(uint64_t *reg_state, hwaddr offset,
> +   uint64_t value)
> +{
> +switch (offset) {
> +case A_CXL_DEV_MAILBOX_CMD:
> +break;
> +case A_CXL_DEV_BG_CMD_STS:
> +/* BG not supported */
> +/* fallthrough */
> +case A_CXL_DEV_MAILBOX_STS:
> +/* Read only register, will get updated by the state machine */
> +return;
> +default:
> +qemu_log_mask(LOG_UNIMP,
> +  "%s Unexpected 64-bit access to 0x%" PRIx64 " (WI)\n",
> +  __func__, offset);
> +return;

which is the behaviour here.

> +}
> +
> +
> +reg_state[offset / 8] = value;
> +}
> +
> +static void mailbox_reg_write(void *opaque, hwaddr offset, uint64_t value,
> +  unsigned size)
> +{
> +CXLDeviceState *cxl_dstate = opaque;
> +
> +if (offset >= A_CXL_DEV_CMD_PAYLOAD) {
> +memcpy(cxl_dstate->mbox_reg_state + offset, , size);
> +return;
> +}
> +
> +/*
> + * Lock is needed to prevent concurrent writes as well as to prevent 
> writes
> + * coming in while the firmware is processing. Without background 
> commands
> + * or the second mailbox implemented, this serves no purpose since the
> + * memory access is synchronized at a higher level (per memory region).
> + */
> +RCU_READ_LOCK_GUARD();

RCU_READ_LOCK doesn't prevent concurrent writes, although the BQL should.

> +
> +switch (size) {
> +case 4:
> +mailbox_mem_writel(cxl_dstate->mbox_reg_state32, offset, value);
> +break;
> +case 8:
> +mailbox_mem_writeq(cxl_dstate->mbox_reg_state64, offset, value);
> +break;
> +default:
> +g_assert_not_reached();
> +}
> +
> +if (ARRAY_FIELD_EX32(cxl_dstate->mbox_reg_state32, CXL_DEV_MAILBOX_CTRL,
> + DOORBELL))
> +cxl_process_mailbox(cxl_dstate);
> +}
> +
> +static const MemoryRegionOps mailbox_ops = {
> +.read = mailbox_reg_read,
> +.write = mailbox_reg_write,
> +.endianness = DEVICE_LITTLE_ENDIAN,
> +.valid = {
> +.min_access_size = 1,
> +.max_access_size = 8,
> +.unaligned = false,
> +},
> +.impl = {
> +.min_access_size = 1,
> +.max_access_size = 8,
> +},

with_attrs?

> +};
> +
>  static const MemoryRegionOps dev_ops = {
>  .read = dev_reg_read,
>  .write = NULL, /* status register is read only */
> @@ -80,20 +189,33 @@ void cxl_device_register_block_init(Object *obj, 
> CXLDeviceState

Re: [PATCH v4 01/15] pcie: Add support for Single Root I/O Virtualization (SR/IOV)

2022-01-26 Thread Knut Omang

On Wed, 2022-01-26 at 18:11 +0100, Lukasz Maniak wrote:
> From: Knut Omang 
> 
> This patch provides the building blocks for creating an SR/IOV
> PCIe Extended Capability header and register/unregister
> SR/IOV Virtual Functions.
> 
> Signed-off-by: Knut Omang 
> ---
>  hw/pci/meson.build  |   1 +
>  hw/pci/pci.c    | 100 +---
>  hw/pci/pcie.c   |   5 +
>  hw/pci/pcie_sriov.c | 294 
>  hw/pci/trace-events |   5 +
>  include/hw/pci/pci.h    |  12 +-
>  include/hw/pci/pcie.h   |   6 +
>  include/hw/pci/pcie_sriov.h |  71 +
>  include/qemu/typedefs.h |   2 +
>  9 files changed, 470 insertions(+), 26 deletions(-)
>  create mode 100644 hw/pci/pcie_sriov.c
>  create mode 100644 include/hw/pci/pcie_sriov.h
> 
> diff --git a/hw/pci/meson.build b/hw/pci/meson.build
> index 5c4bbac817..bcc9c75919 100644
> --- a/hw/pci/meson.build
> +++ b/hw/pci/meson.build
> @@ -5,6 +5,7 @@ pci_ss.add(files(
>    'pci.c',
>    'pci_bridge.c',
>    'pci_host.c',
> +  'pcie_sriov.c',
>    'shpc.c',
>    'slotid_cap.c'
>  ))
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 5d30f9ca60..ba8fb92efc 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -239,6 +239,9 @@ int pci_bar(PCIDevice *d, int reg)
>  {
>  uint8_t type;
>  
> +    /* PCIe virtual functions do not have their own BARs */
> +    assert(!pci_is_vf(d));
> +
>  if (reg != PCI_ROM_SLOT)
>  return PCI_BASE_ADDRESS_0 + reg * 4;
>  
> @@ -304,10 +307,30 @@ void pci_device_deassert_intx(PCIDevice *dev)
>  }
>  }
>  
> -static void pci_do_device_reset(PCIDevice *dev)
> +static void pci_reset_regions(PCIDevice *dev)
>  {
>  int r;
> +    if (pci_is_vf(dev)) {
> +    return;
> +    }
> +
> +    for (r = 0; r < PCI_NUM_REGIONS; ++r) {
> +    PCIIORegion *region = >io_regions[r];
> +    if (!region->size) {
> +    continue;
> +    }
>  
> +    if (!(region->type & PCI_BASE_ADDRESS_SPACE_IO) &&
> +    region->type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> +    pci_set_quad(dev->config + pci_bar(dev, r), region->type);
> +    } else {
> +    pci_set_long(dev->config + pci_bar(dev, r), region->type);
> +    }
> +    }
> +}
> +
> +static void pci_do_device_reset(PCIDevice *dev)
> +{
>  pci_device_deassert_intx(dev);
>  assert(dev->irq_state == 0);
>  
> @@ -323,19 +346,7 @@ static void pci_do_device_reset(PCIDevice *dev)
>    pci_get_word(dev->wmask + PCI_INTERRUPT_LINE) |
>    pci_get_word(dev->w1cmask + 
> PCI_INTERRUPT_LINE));
>  dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
> -    for (r = 0; r < PCI_NUM_REGIONS; ++r) {
> -    PCIIORegion *region = >io_regions[r];
> -    if (!region->size) {
> -    continue;
> -    }
> -
> -    if (!(region->type & PCI_BASE_ADDRESS_SPACE_IO) &&
> -    region->type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> -    pci_set_quad(dev->config + pci_bar(dev, r), region->type);
> -    } else {
> -    pci_set_long(dev->config + pci_bar(dev, r), region->type);
> -    }
> -    }
> +    pci_reset_regions(dev);
>  pci_update_mappings(dev);
>  
>  msi_reset(dev);
> @@ -884,6 +895,16 @@ static void pci_init_multifunction(PCIBus *bus, 
> PCIDevice *dev,
> Error **errp)
>  dev->config[PCI_HEADER_TYPE] |= PCI_HEADER_TYPE_MULTI_FUNCTION;
>  }
>  
> +    /*
> + * With SR/IOV and ARI, a device at function 0 need not be a 
> multifunction
> + * device, as it may just be a VF that ended up with function 0 in
> + * the legacy PCI interpretation. Avoid failing in such cases:
> + */
> +    if (pci_is_vf(dev) &&
> +    dev->exp.sriov_vf.pf->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
> +    return;
> +    }
> +
>  /*
>   * multifunction bit is interpreted in two ways as follows.
>   *   - all functions must set the bit to 1.
> @@ -1083,6 +1104,7 @@ static PCIDevice *do_pci_register_device(PCIDevice 
> *pci_dev,
>     bus->devices[devfn]->name);
>  return NULL;
>  } else if (dev->hotplugged &&
> +   !pci_is_vf(pci_dev) &&
>     pci_get_function_0(pci_dev)) {
>  error_setg(errp, "PCI: slot %d function 0 already occupied by %s,"
>     " new func %s cannot be exposed to guest.",
> @@ -1191,6 +1213,7 @@ void pci_register_bar(PCIDevice *pci_dev, int 
> region_num,
>  pcibus_t size = memory_region_size(memory);
>  uint8_t hdr_type;
>  
> +    assert(!pci_is_vf(pci_dev)); /* VFs must use pcie_sriov_vf_register_bar 
> */
>  assert(region_num >= 0);
>  assert(region_num < PCI_NUM_REGIONS);
>  assert(is_power_of_2(size));
> @@ -1294,11 +1317,45 @@ pcibus_t pci_get_bar_addr(PCIDevice *pci_dev, int 
> region_num)
>  return pci_dev->io_regions[region_num].addr;
>  }
>  
> -static pcibus_t pci_bar_address(PCIDevice *d,
> -

Re: [PATCH 0/8] target/ppc: powerpc_excp improvements [74xx] (5/n)

2022-01-26 Thread Cédric Le Goater


On 1/26/22 18:55, BALATON Zoltan wrote:

On Wed, 26 Jan 2022, Fabiano Rosas wrote:

This handles the exception code for the 74xx family, i.e. 7400, 7410,
7440, 7445, 7450, 7455, 7457.

This is another family that is quite well known, so it should be
straight-forward as well.


I guess this is what may break VOF on pegasos2. Was Philippe's test case for 
this machine ever merged? (Although that may use the firmware ROM that was 
preferred as it tests more of the machine and may predate VOF so not sure it 
also tests with VOF.) The way to test it is this:

Get morphos demo ISO from https://www.morphos-team.net/morphos-3.15.iso
Extract boot.img from the root directory of the CD
Run QEMU as shown at http://zero.eik.bme.hu/~balaton/qemu/amiga/#morphos


I could never make it work :/



(For debugging maybe enabling vof traces would give more info but it was a 
while so I don't remember the details any more.)


Based on legoater/ppc-7.0


I could test when it's merged or when it applies on master but I don't usually 
test on branches. Did you verify it still works with pegasos2 or could you 
please make sure it won't break that use case?


I have a large set of images, here, that I use for non regression tests :

  https://github.com/legoater/qemu-ppc-boot

If we could add a pegasos, that would be nice.

Thanks,

C.

Re: [PATCH v4 04/42] hw/cxl/device: Introduce a CXL device (8.2.8)

2022-01-26 Thread Alex Bennée



Jonathan Cameron  writes:

> From: Ben Widawsky 
>
> A CXL device is a type of CXL component. Conceptually, a CXL device
> would be a leaf node in a CXL topology. From an emulation perspective,
> CXL devices are the most complex and so the actual implementation is
> reserved for discrete commits.
>
> This new device type is specifically catered towards the eventual
> implementation of a Type3 CXL.mem device, 8.2.8.5 in the CXL 2.0
> specification.
>
> Signed-off-by: Ben Widawsky 
> Signed-off-by: Jonathan Cameron 
> ---
>  include/hw/cxl/cxl.h|   1 +
>  include/hw/cxl/cxl_device.h | 157 
>  2 files changed, 158 insertions(+)
>
> diff --git a/include/hw/cxl/cxl.h b/include/hw/cxl/cxl.h
> index 8c738c7a2b..b9d1ac3fad 100644
> --- a/include/hw/cxl/cxl.h
> +++ b/include/hw/cxl/cxl.h
> @@ -12,5 +12,6 @@
>  
>  #include "cxl_pci.h"
>  #include "cxl_component.h"
> +#include "cxl_device.h"
>  
>  #endif
> diff --git a/include/hw/cxl/cxl_device.h b/include/hw/cxl/cxl_device.h
> new file mode 100644
> index 00..3b6ed745f0
> --- /dev/null
> +++ b/include/hw/cxl/cxl_device.h
> @@ -0,0 +1,157 @@
> +/*
> + * QEMU CXL Devices
> + *
> + * Copyright (c) 2020 Intel
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2. See the
> + * COPYING file in the top-level directory.
> + */
> +
> +#ifndef CXL_DEVICE_H
> +#define CXL_DEVICE_H
> +
> +#include "hw/register.h"
> +
> +/*
> + * The following is how a CXL device's MMIO space is laid out. The only
> + * requirement from the spec is that the capabilities array and the 
> capability
> + * headers start at offset 0 and are contiguously packed. The headers 
> themselves
> + * provide offsets to the register fields. For this emulation, registers will
> + * start at offset 0x80 (m == 0x80). No secondary mailbox is implemented 
> which
> + * means that n = m + sizeof(mailbox registers) + sizeof(device registers).
> + *
> + * This is roughly described in 8.2.8 Figure 138 of the CXL 2.0 spec.
> + *
> + *   +-+
> + *   | |
> + *   |Memory Device Registers  |
> + *   | |
> + * n + PAYLOAD_SIZE_MAX  ---
> + *  ^| |
> + *  || |
> + *  || |
> + *  || |
> + *  || |
> + *  || Mailbox Payload |
> + *  || |
> + *  || |
> + *  || |
> + *  |---
> + *  ||   Mailbox Registers |
> + *  || |
> + *  n---
> + *  ^| |
> + *  ||Device Registers |
> + *  || |
> + *  m-->
> + *  ^|  Memory Device Capability Header|
> + *  |---
> + *  || Mailbox Capability Header   |
> + *  |-- 
> + *  || Device Capability Header|
> + *  |---
> + *  || |
> + *  || |
> + *  ||  Device Cap Array[0..n] |
> + *  || |
> + *  || |
> + *   | |
> + *  0+-+
> + *
> + */

Excellent diagram ;-)

> +
> +#define CXL_DEVICE_CAP_HDR1_OFFSET 0x10 /* Figure 138 */
> +#define CXL_DEVICE_CAP_REG_SIZE 0x10 /* 8.2.8.2 */
> +#define CXL_DEVICE_CAPS_MAX 4 /* 8.2.8.2.1 + 8.2.8.5 */
> +
> +#define CXL_DEVICE_REGISTERS_OFFSET 0x80 /* Read comment above */
> +#define CXL_DEVICE_REGISTERS_LENGTH 0x8 /* 8.2.8.3.1 */
> +
> +#define CXL_MAILBOX_REGISTERS_OFFSET \
> +(CXL_DEVICE_REGISTERS_OFFSET + CXL_DEVICE_REGISTERS_LENGTH)
> +#define CXL_MAILBOX_REGISTERS_SIZE 0x20 /* 8.2.8.4, Figure 139 */
> +#define CXL_MAILBOX_PAYLOAD_SHIFT 11
> +#define CXL_MAILBOX_MAX_PAYLOAD_SIZE (1 << CXL_MAILBOX_PAYLOAD_SHIFT)
> +#define CXL_MAILBOX_REGISTERS_LENGTH \
> +(CXL_MAILBOX_REGISTERS_SIZE +

Re: [PATCH 0/2] Adds designware i2c module and adds it to virt arm

2022-01-26 Thread Peter Maydell

On Wed, 26 Jan 2022 at 17:12, Chris Rauer  wrote:
>
>> I need to see a pretty strong justification for why we should be
>> adding new kinds of devices to the virt machine,
>
> The designware i2c controller is a very common controller on many
>  ARM SoCs.  It has device tree bindings and ACPI bindings which
> makes it ideal for this platform.

No, I mean, why do we need an i2c controller on the virt
board at all ?

> >Forgot to mention, but my prefered approach for providing
> >an i2c controller on the virt board would be to have a
> >PCI i2c controller: that way users who do need it can plug it
> >in with a -device command line option, and users who don't
> >need it never have to worry about it.

> > (We seem to have an ICH9-SMB PCI device already; I have no idea if it's 
> > suitable.)
> I didn't find that device suitable because it is part of the Intel
> Southbridge, which may have some Intel platform quirks, and
> we don't need all the things in that IO hub.

That's a pity. Is there a different PCI I2C controller we could model ?

thanks
-- PMM

Re: [PATCH v4 05/42] hw/cxl/device: Implement the CAP array (8.2.8.1-2)

2022-01-26 Thread Alex Bennée



Jonathan Cameron  writes:

> From: Ben Widawsky 
>
> This implements all device MMIO up to the first capability. That
> includes the CXL Device Capabilities Array Register, as well as all of
> the CXL Device Capability Header Registers. The latter are filled in as
> they are implemented in the following patches.
>
> Endianness and alignment are managed by softmmu memory core.
>
> Signed-off-by: Ben Widawsky 
> Signed-off-by: Jonathan Cameron 
> ---
>  hw/cxl/cxl-device-utils.c   | 105 
>  hw/cxl/meson.build  |   1 +
>  include/hw/cxl/cxl_device.h |  28 +-
>  3 files changed, 133 insertions(+), 1 deletion(-)
>
> diff --git a/hw/cxl/cxl-device-utils.c b/hw/cxl/cxl-device-utils.c
> new file mode 100644
> index 00..cb1b0a8217
> --- /dev/null
> +++ b/hw/cxl/cxl-device-utils.c
> @@ -0,0 +1,105 @@
> +/*
> + * CXL Utility library for devices
> + *
> + * Copyright(C) 2020 Intel Corporation.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2. See the
> + * COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/log.h"
> +#include "hw/cxl/cxl.h"
> +
> +/*
> + * Device registers have no restrictions per the spec, and so fall back to 
> the
> + * default memory mapped register rules in 8.2:
> + *   Software shall use CXL.io Memory Read and Write to access memory mapped
> + *   register defined in this section. Unless otherwise specified, software
> + *   shall restrict the accesses width based on the following:
> + *   • A 32 bit register shall be accessed as a 1 Byte, 2 Bytes or 4 Bytes
> + * quantity.
> + *   • A 64 bit register shall be accessed as a 1 Byte, 2 Bytes, 4 Bytes or 8
> + * Bytes
> + *   • The address shall be a multiple of the access width, e.g. when
> + * accessing a register as a 4 Byte quantity, the address shall be
> + * multiple of 4.
> + *   • The accesses shall map to contiguous bytes.If these rules are not
> + * followed, the behavior is undefined
> + */
> +
> +static uint64_t caps_reg_read(void *opaque, hwaddr offset, unsigned size)
> +{
> +CXLDeviceState *cxl_dstate = opaque;
> +
> +return cxl_dstate->caps_reg_state32[offset / 4];
> +}
> +
> +static uint64_t dev_reg_read(void *opaque, hwaddr offset, unsigned size)
> +{
> +return 0;
> +}
> +
> +static const MemoryRegionOps dev_ops = {
> +.read = dev_reg_read,
> +.write = NULL, /* status register is read only */
> +.endianness = DEVICE_LITTLE_ENDIAN,
> +.valid = {
> +.min_access_size = 1,
> +.max_access_size = 8,
> +.unaligned = false,
> +},
> +.impl = {
> +.min_access_size = 1,
> +.max_access_size = 8,
> +},
> +};

I think for >64 bit registers you need to use the read_with_attrs 

> +
> +static const MemoryRegionOps caps_ops = {
> +.read = caps_reg_read,
> +.write = NULL, /* caps registers are read only */
> +.endianness = DEVICE_LITTLE_ENDIAN,
> +.valid = {
> +.min_access_size = 1,
> +.max_access_size = 8,
> +.unaligned = false,
> +},
> +.impl = {
> +.min_access_size = 4,
> +.max_access_size = 4,
> +},
> +};
> +
> +void cxl_device_register_block_init(Object *obj, CXLDeviceState *cxl_dstate)
> +{
> +/* This will be a BAR, so needs to be rounded up to pow2 for PCI spec */
> +memory_region_init(_dstate->device_registers, obj, 
> "device-registers",
> +   pow2ceil(CXL_MMIO_SIZE));
> +
> +memory_region_init_io(_dstate->caps, obj, _ops, cxl_dstate,
> +  "cap-array", CXL_CAPS_SIZE);
> +memory_region_init_io(_dstate->device, obj, _ops, cxl_dstate,
> +  "device-status", CXL_DEVICE_REGISTERS_LENGTH);
> +
> +memory_region_add_subregion(_dstate->device_registers, 0,
> +_dstate->caps);
> +memory_region_add_subregion(_dstate->device_registers,
> +CXL_DEVICE_REGISTERS_OFFSET,
> +_dstate->device);
> +}
> +
> +static void device_reg_init_common(CXLDeviceState *cxl_dstate) { }
> +
> +void cxl_device_register_init_common(CXLDeviceState *cxl_dstate)
> +{
> +uint32_t *cap_hdrs = cxl_dstate->caps_reg_state32;
> +const int cap_count = 1;
> +
> +/* CXL Device Capabilities Array Register */
> +ARRAY_FIELD_DP32(cap_hdrs, CXL_DEV_CAP_ARRAY, CAP_ID, 0);
> +ARRAY_FIELD_DP32(cap_hdrs, CXL_DEV_CAP_ARRAY, CAP_VERSION, 1);
> +ARRAY_FIELD_DP32(cap_hdrs, CXL_DEV_CAP_ARRAY2, CAP_COUNT, cap_count);
> +
> +cxl_device_cap_init(cxl_dstate, DEVICE, 1);
> +device_reg_init_common(cxl_dstate);
> +}
> diff --git a/hw/cxl/meson.build b/hw/cxl/meson.build
> index 00c3876a0f..47154d6850 100644
> --- a/hw/cxl/meson.build
> +++ b/hw/cxl/meson.build
> @@ -1,3 +1,4 @@
>  softmmu_ss.add(when: 'CONFIG_CXL', if_true: files(
>'cxl-component-utils.c',
> +  'cxl-device-utils.c',
>  ))
>

Re: [PATCH v5 03/18] pci: isolated address space for PCI bus

2022-01-26 Thread Dr. David Alan Gilbert

* Jag Raman (jag.ra...@oracle.com) wrote:
> 
> 
> > On Jan 25, 2022, at 1:38 PM, Dr. David Alan Gilbert  
> > wrote:
> > 
> > * Jag Raman (jag.ra...@oracle.com) wrote:
> >> 
> >> 
> >>> On Jan 19, 2022, at 7:12 PM, Michael S. Tsirkin  wrote:
> >>> 
> >>> On Wed, Jan 19, 2022 at 04:41:52PM -0500, Jagannathan Raman wrote:
>  Allow PCI buses to be part of isolated CPU address spaces. This has a
>  niche usage.
>  
>  TYPE_REMOTE_MACHINE allows multiple VMs to house their PCI devices in
>  the same machine/server. This would cause address space collision as
>  well as be a security vulnerability. Having separate address spaces for
>  each PCI bus would solve this problem.
> >>> 
> >>> Fascinating, but I am not sure I understand. any examples?
> >> 
> >> Hi Michael!
> >> 
> >> multiprocess QEMU and vfio-user implement a client-server model to allow
> >> out-of-process emulation of devices. The client QEMU, which makes ioctls
> >> to the kernel and runs VCPUs, could attach devices running in a server
> >> QEMU. The server QEMU needs access to parts of the client’s RAM to
> >> perform DMA.
> > 
> > Do you ever have the opposite problem? i.e. when an emulated PCI device
> 
> That’s an interesting question.
> 
> > exposes a chunk of RAM-like space (frame buffer, or maybe a mapped file)
> > that the client can see.  What happens if two emulated devices need to
> > access each others emulated address space?
> 
> In this case, the kernel driver would map the destination’s chunk of internal 
> RAM into
> the DMA space of the source device. Then the source device could write to that
> mapped address range, and the IOMMU should direct those writes to the
> destination device.

Are all devices mappable like that?

> I would like to take a closer look at the IOMMU implementation on how to 
> achieve
> this, and get back to you. I think the IOMMU would handle this. Could you 
> please
> point me to the IOMMU implementation you have in mind?

I didn't have one in mind; I was just hitting a similar problem on
Virtiofs DAX.

Dave

> Thank you!
> --
> Jag
> 
> > 
> > Dave
> > 
> >> In the case where multiple clients attach devices that are running on the
> >> same server, we need to ensure that each devices has isolated memory
> >> ranges. This ensures that the memory space of one device is not visible
> >> to other devices in the same server.
> >> 
> >>> 
> >>> I also wonder whether this special type could be modelled like a special
> >>> kind of iommu internally.
> >> 
> >> Could you please provide some more details on the design?
> >> 
> >>> 
>  Signed-off-by: Elena Ufimtseva 
>  Signed-off-by: John G Johnson 
>  Signed-off-by: Jagannathan Raman 
>  ---
>  include/hw/pci/pci.h |  2 ++
>  include/hw/pci/pci_bus.h | 17 +
>  hw/pci/pci.c | 17 +
>  hw/pci/pci_bridge.c  |  5 +
>  4 files changed, 41 insertions(+)
>  
>  diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>  index 023abc0f79..9bb4472abc 100644
>  --- a/include/hw/pci/pci.h
>  +++ b/include/hw/pci/pci.h
>  @@ -387,6 +387,8 @@ void pci_device_save(PCIDevice *s, QEMUFile *f);
>  int pci_device_load(PCIDevice *s, QEMUFile *f);
>  MemoryRegion *pci_address_space(PCIDevice *dev);
>  MemoryRegion *pci_address_space_io(PCIDevice *dev);
>  +AddressSpace *pci_isol_as_mem(PCIDevice *dev);
>  +AddressSpace *pci_isol_as_io(PCIDevice *dev);
>  
>  /*
>  * Should not normally be used by devices. For use by sPAPR target
>  diff --git a/include/hw/pci/pci_bus.h b/include/hw/pci/pci_bus.h
>  index 347440d42c..d78258e79e 100644
>  --- a/include/hw/pci/pci_bus.h
>  +++ b/include/hw/pci/pci_bus.h
>  @@ -39,9 +39,26 @@ struct PCIBus {
> void *irq_opaque;
> PCIDevice *devices[PCI_SLOT_MAX * PCI_FUNC_MAX];
> PCIDevice *parent_dev;
>  +
> MemoryRegion *address_space_mem;
> MemoryRegion *address_space_io;
>  
>  +/**
>  + * Isolated address spaces - these allow the PCI bus to be part
>  + * of an isolated address space as opposed to the global
>  + * address_space_memory & address_space_io.
> >>> 
> >>> Are you sure address_space_memory & address_space_io are
> >>> always global? even in the case of an iommu?
> >> 
> >> On the CPU side of the Root Complex, I believe address_space_memory
> >> & address_space_io are global.
> >> 
> >> In the vfio-user case, devices on the same machine (TYPE_REMOTE_MACHINE)
> >> could be attached to different clients VMs. Each client would have their 
> >> own address
> >> space for their CPUs. With isolated address spaces, we ensure that the 
> >> devices
> >> see the address space of the CPUs they’re attached to.
> >> 
> >> Not sure if it’s OK to share weblinks in this mailing list, please let me 
> >> know if that’s
> >> not preferred. But I’m referring to the terminology used in

[PATCH v2 2/2] linux-user: sigprocmask check read perms first

2022-01-26 Thread Patrick Venture

Linux kernel now checks the read permissions before validating `how`

Suggested-by: Laurent Vivier 
Signed-off-by: Patrick Venture 
---
 linux-user/syscall.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index 34bd819e38..210483d4e4 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -9465,6 +9465,13 @@ static abi_long do_syscall1(void *cpu_env, int num, 
abi_long arg1,
 int how;
 
 if (arg2) {
+p = lock_user(VERIFY_READ, arg2, sizeof(target_sigset_t), 1));
+if (!p) {
+return -TARGET_EFAULT;
+}
+target_to_host_old_sigset(, p);
+unlock_user(p, arg2, 0);
+set_ptr = 
 switch (arg1) {
 case TARGET_SIG_BLOCK:
 how = SIG_BLOCK;
@@ -9478,11 +9485,6 @@ static abi_long do_syscall1(void *cpu_env, int num, 
abi_long arg1,
 default:
 return -TARGET_EINVAL;
 }
-if (!(p = lock_user(VERIFY_READ, arg2, 
sizeof(target_sigset_t), 1)))
-return -TARGET_EFAULT;
-target_to_host_old_sigset(, p);
-unlock_user(p, arg2, 0);
-set_ptr = 
 } else {
 how = 0;
 set_ptr = NULL;
-- 
2.35.0.rc0.227.g00780c9af4-goog

Re: [PATCH 0/8] target/ppc: powerpc_excp improvements [74xx] (5/n)

2022-01-26 Thread BALATON Zoltan


On Wed, 26 Jan 2022, Fabiano Rosas wrote:

This handles the exception code for the 74xx family, i.e. 7400, 7410,
7440, 7445, 7450, 7455, 7457.

This is another family that is quite well known, so it should be
straight-forward as well.


I guess this is what may break VOF on pegasos2. Was Philippe's test case 
for this machine ever merged? (Although that may use the firmware ROM that 
was preferred as it tests more of the machine and may predate VOF so not 
sure it also tests with VOF.) The way to test it is this:


Get morphos demo ISO from https://www.morphos-team.net/morphos-3.15.iso
Extract boot.img from the root directory of the CD
Run QEMU as shown at http://zero.eik.bme.hu/~balaton/qemu/amiga/#morphos

(For debugging maybe enabling vof traces would give more info but it was 
a while so I don't remember the details any more.)



Based on legoater/ppc-7.0


I could test when it's merged or when it applies on master but I don't 
usually test on branches. Did you verify it still works with pegasos2 or 
could you please make sure it won't break that use case?


Regards,
BALATON Zoltan


Fabiano Rosas (8):
 target/ppc: Introduce powerpc_excp_74xx
 target/ppc: Simplify powerpc_excp_74xx
 target/ppc: 74xx: Machine Check exception cleanup
 target/ppc: 74xx: External interrupt cleanup
 target/ppc: 74xx: Program exception cleanup
 target/ppc: 74xx: System Call exception cleanup
 target/ppc: 74xx: System Reset interrupt cleanup
 target/ppc: 74xx: Set SRRs directly in exception code

target/ppc/excp_helper.c | 175 +++
1 file changed, 175 insertions(+)

[PATCH v2 1/2] linux-user: rt_sigprocmask, check read perms first

2022-01-26 Thread Patrick Venture

From: Shu-Chun Weng 

Linux kernel does it this way (checks read permission before validating `how`)
and the latest version of ABSL's `AddressIsReadable()` depends on this
behavior.

c.f.  
https://github.com/torvalds/linux/blob/9539ba4308ad5bdca6cb41c7b73cbb9f796dcdd7/kernel/signal.c#L3147
Reviewed-by: Patrick Venture 
Signed-off-by: Shu-Chun Weng 
---
 linux-user/syscall.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index 5950222a77..34bd819e38 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -9508,6 +9508,13 @@ static abi_long do_syscall1(void *cpu_env, int num, 
abi_long arg1,
 }
 
 if (arg2) {
+p = lock_user(VERIFY_READ, arg2, sizeof(target_sigset_t), 1);
+if (!p) {
+return -TARGET_EFAULT;
+}
+target_to_host_sigset(, p);
+unlock_user(p, arg2, 0);
+set_ptr = 
 switch(how) {
 case TARGET_SIG_BLOCK:
 how = SIG_BLOCK;
@@ -9521,11 +9528,6 @@ static abi_long do_syscall1(void *cpu_env, int num, 
abi_long arg1,
 default:
 return -TARGET_EINVAL;
 }
-if (!(p = lock_user(VERIFY_READ, arg2, 
sizeof(target_sigset_t), 1)))
-return -TARGET_EFAULT;
-target_to_host_sigset(, p);
-unlock_user(p, arg2, 0);
-set_ptr = 
 } else {
 how = 0;
 set_ptr = NULL;
-- 
2.35.0.rc0.227.g00780c9af4-goog

Re: [PATCH v4 03/42] MAINTAINERS: Add entry for Compute Express Link Emulation

2022-01-26 Thread Alex Bennée



Jonathan Cameron  writes:

> From: Jonathan Cameron 
>
> The CXL emulation will be jointly maintained by Ben Widawsky
> and Jonathan Cameron.  Broken out as a separate patch
> to improve visibility.
>
> Signed-off-by: Jonathan Cameron 

Reviewed-by: Alex Bennée 

-- 
Alex Bennée

Re: [PATCH 1/2] ppc/pnv: initialize 'taddr' in pnv_phb3_translate_tve()

2022-01-26 Thread Matheus K. Ferst


On 26/01/2022 10:41, Daniel Henrique Barboza wrote:

The 'taddr' variable is left unintialized, being set only inside the
"while ((lev--) >= 0)" loop where we get the TCE address. The 'lev' var
is an int32_t that is being initiliazed by the GETFIELD() macro, which
returns an uint64_t.

For a human reader this means that 'lev' will always be positive or zero.
But some compilers may beg to differ. 'lev' being an int32_t can in theory
be set as negative, and the "while ((lev--) >= 0)" loop might never be
reached, and 'taddr' will be left unitialized.


If we expect this code to execute at least once, wouldn't it be better 
to use a do-while? E.g.:


do {
lev--;

/* Grab the TCE address */
taddr = base | (((addr >> sh) & ((1ul << tbl_shift) - 1)) << 3);
if (dma_memory_read(_space_memory, taddr, ,
/* ... */
}
sh -= tbl_shift;
base = tce & ~0xfffull;
} while (lev >= 0);

Otherwise, I think we'll need to initialize tce too.

Thanks,
Matheus K. Ferst
Instituto de Pesquisas ELDORADO 
Analista de Software
Aviso Legal - Disclaimer

[PATCH v2 0/2] linux-user: check read permissions before how

2022-01-26 Thread Patrick Venture

In sigprocmask check the read permissions first before checking the `how`.

This is done for both: TARGET_NR_sigprocmask and TARGET_NR_rt_sigprocmask

v2:
 * Update code style during code change
 * Also update check order for TARGET_NR_sigprocmask

Patrick Venture (1):
  linux-user: sigprocmask check read perms first

Shu-Chun Weng (1):
  linux-user: rt_sigprocmask, check read perms first

 linux-user/syscall.c | 24 ++--
 1 file changed, 14 insertions(+), 10 deletions(-)

-- 
2.35.0.rc0.227.g00780c9af4-goog

Re: [PATCH v4 01/15] pcie: Add support for Single Root I/O Virtualization (SR/IOV)

2022-01-26 Thread Lukasz Maniak

On Wed, Jan 26, 2022 at 06:11:06PM +0100, Lukasz Maniak wrote:
> From: Knut Omang 
> 
> This patch provides the building blocks for creating an SR/IOV
> PCIe Extended Capability header and register/unregister
> SR/IOV Virtual Functions.
> 
> Signed-off-by: Knut Omang 

Hi Knut,

We have edited the comments to which Michael drew attention.
I also resolved the issues reported by the checkpatch script for this
patch.

Please kindly check and confirm that you agree with these changes.

Thanks,
Lukasz

> ---
>  hw/pci/meson.build  |   1 +
>  hw/pci/pci.c| 100 +---
>  hw/pci/pcie.c   |   5 +
>  hw/pci/pcie_sriov.c | 294 
>  hw/pci/trace-events |   5 +
>  include/hw/pci/pci.h|  12 +-
>  include/hw/pci/pcie.h   |   6 +
>  include/hw/pci/pcie_sriov.h |  71 +
>  include/qemu/typedefs.h |   2 +
>  9 files changed, 470 insertions(+), 26 deletions(-)
>  create mode 100644 hw/pci/pcie_sriov.c
>  create mode 100644 include/hw/pci/pcie_sriov.h
> 
> diff --git a/hw/pci/meson.build b/hw/pci/meson.build
> index 5c4bbac817..bcc9c75919 100644
> --- a/hw/pci/meson.build
> +++ b/hw/pci/meson.build
> @@ -5,6 +5,7 @@ pci_ss.add(files(
>'pci.c',
>'pci_bridge.c',
>'pci_host.c',
> +  'pcie_sriov.c',
>'shpc.c',
>'slotid_cap.c'
>  ))
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 5d30f9ca60..ba8fb92efc 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -239,6 +239,9 @@ int pci_bar(PCIDevice *d, int reg)
>  {
>  uint8_t type;
>  
> +/* PCIe virtual functions do not have their own BARs */
> +assert(!pci_is_vf(d));
> +
>  if (reg != PCI_ROM_SLOT)
>  return PCI_BASE_ADDRESS_0 + reg * 4;
>  
> @@ -304,10 +307,30 @@ void pci_device_deassert_intx(PCIDevice *dev)
>  }
>  }
>  
> -static void pci_do_device_reset(PCIDevice *dev)
> +static void pci_reset_regions(PCIDevice *dev)
>  {
>  int r;
> +if (pci_is_vf(dev)) {
> +return;
> +}
> +
> +for (r = 0; r < PCI_NUM_REGIONS; ++r) {
> +PCIIORegion *region = >io_regions[r];
> +if (!region->size) {
> +continue;
> +}
>  
> +if (!(region->type & PCI_BASE_ADDRESS_SPACE_IO) &&
> +region->type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> +pci_set_quad(dev->config + pci_bar(dev, r), region->type);
> +} else {
> +pci_set_long(dev->config + pci_bar(dev, r), region->type);
> +}
> +}
> +}
> +
> +static void pci_do_device_reset(PCIDevice *dev)
> +{
>  pci_device_deassert_intx(dev);
>  assert(dev->irq_state == 0);
>  
> @@ -323,19 +346,7 @@ static void pci_do_device_reset(PCIDevice *dev)
>pci_get_word(dev->wmask + PCI_INTERRUPT_LINE) |
>pci_get_word(dev->w1cmask + 
> PCI_INTERRUPT_LINE));
>  dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
> -for (r = 0; r < PCI_NUM_REGIONS; ++r) {
> -PCIIORegion *region = >io_regions[r];
> -if (!region->size) {
> -continue;
> -}
> -
> -if (!(region->type & PCI_BASE_ADDRESS_SPACE_IO) &&
> -region->type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> -pci_set_quad(dev->config + pci_bar(dev, r), region->type);
> -} else {
> -pci_set_long(dev->config + pci_bar(dev, r), region->type);
> -}
> -}
> +pci_reset_regions(dev);
>  pci_update_mappings(dev);
>  
>  msi_reset(dev);
> @@ -884,6 +895,16 @@ static void pci_init_multifunction(PCIBus *bus, 
> PCIDevice *dev, Error **errp)
>  dev->config[PCI_HEADER_TYPE] |= PCI_HEADER_TYPE_MULTI_FUNCTION;
>  }
>  
> +/*
> + * With SR/IOV and ARI, a device at function 0 need not be a 
> multifunction
> + * device, as it may just be a VF that ended up with function 0 in
> + * the legacy PCI interpretation. Avoid failing in such cases:
> + */
> +if (pci_is_vf(dev) &&
> +dev->exp.sriov_vf.pf->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
> +return;
> +}
> +
>  /*
>   * multifunction bit is interpreted in two ways as follows.
>   *   - all functions must set the bit to 1.
> @@ -1083,6 +1104,7 @@ static PCIDevice *do_pci_register_device(PCIDevice 
> *pci_dev,
> bus->devices[devfn]->name);
>  return NULL;
>  } else if (dev->hotplugged &&
> +   !pci_is_vf(pci_dev) &&
> pci_get_function_0(pci_dev)) {
>  error_setg(errp, "PCI: slot %d function 0 already occupied by %s,"
> " new func %s cannot be exposed to guest.",
> @@ -1191,6 +1213,7 @@ void pci_register_bar(PCIDevice *pci_dev, int 
> region_num,
>  pcibus_t size = memory_region_size(memory);
>  uint8_t hdr_type;
>  
> +assert(!pci_is_vf(pci_dev)); /* VFs must use pcie_sriov_vf_register_bar 
> */
>  assert(region_num >= 0);
>  assert(region_num < PCI_NUM_REGIONS);
>

[PATCH v4 15/15] hw/nvme: Update the initalization place for the AER queue

2022-01-26 Thread Lukasz Maniak

From: Łukasz Gieryk 

This patch updates the initialization place for the AER queue, so it’s
initialized once, at controller initialization, and not every time
controller is enabled.

While the original version works for a non-SR-IOV device, as it’s hard
to interact with the controller if it’s not enabled, the multiple
reinitialization is not necessarily correct.

With the SR/IOV feature enabled a segfault can happen: a VF can have its
controller disabled, while a namespace can still be attached to the
controller through the parent PF. An event generated in such case ends
up on an uninitialized queue.

While it’s an interesting question whether a VF should support AER in
the first place, I don’t think it must be answered today.

Signed-off-by: Łukasz Gieryk 
---
 hw/nvme/ctrl.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 624db2f9c6..b2228e960f 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6029,8 +6029,6 @@ static int nvme_start_ctrl(NvmeCtrl *n)
 
 nvme_set_timestamp(n, 0ULL);
 
-QTAILQ_INIT(>aer_queue);
-
 nvme_select_iocs(n);
 
 return 0;
@@ -7007,6 +7005,8 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice 
*pci_dev)
 id->cmic |= NVME_CMIC_MULTI_CTRL;
 }
 
+QTAILQ_INIT(>aer_queue);
+
 NVME_CAP_SET_MQES(cap, 0x7ff);
 NVME_CAP_SET_CQR(cap, 1);
 NVME_CAP_SET_TO(cap, 0xf);
-- 
2.25.1

Re: [PATCH] linux-user: rt_sigprocmask, check read perms first

2022-01-26 Thread Patrick Venture

On Tue, Jan 11, 2022 at 3:06 PM Patrick Venture  wrote:

>
>
> On Tue, Jan 11, 2022 at 12:50 PM Laurent Vivier  wrote:
>
>> Hi Patrick,
>>
>> Le 11/01/2022 à 21:14, Patrick Venture a écrit :
>> >
>> >
>> > On Sat, Jan 8, 2022 at 10:16 AM Laurent Vivier > > wrote:
>> >
>> > Le 06/01/2022 à 23:00, Patrick Venture a écrit :
>> >  > From: Shu-Chun Weng mailto:s...@google.com>>
>> >  >
>> >  > Linux kernel does it this way (checks read permission before
>> validating `how`)
>> >  > and the latest version of ABSL's `AddressIsReadable()` depends
>> on this
>> >  > behavior.
>> >  >
>> >  > c.f.
>> >
>> https://github.com/torvalds/linux/blob/9539ba4308ad5bdca6cb41c7b73cbb9f796dcdd7/kernel/signal.c#L3147
>> > <
>> https://github.com/torvalds/linux/blob/9539ba4308ad5bdca6cb41c7b73cbb9f796dcdd7/kernel/signal.c#L3147
>> >
>> >  > Reviewed-by: Patrick Venture > vent...@google.com>>
>> >  > Signed-off-by: Shu-Chun Weng > s...@google.com>>
>> >  > ---
>> >  >   linux-user/syscall.c | 10 +-
>> >  >   1 file changed, 5 insertions(+), 5 deletions(-)
>> >  >
>> >  > diff --git a/linux-user/syscall.c b/linux-user/syscall.c
>> >  > index ce9d64896c..3070d31f34 100644
>> >  > --- a/linux-user/syscall.c
>> >  > +++ b/linux-user/syscall.c
>> >  > @@ -9491,6 +9491,11 @@ static abi_long do_syscall1(void
>> *cpu_env, int num, abi_long arg1,
>> >  >   }
>> >  >
>> >  >   if (arg2) {
>> >  > +if (!(p = lock_user(VERIFY_READ, arg2,
>> sizeof(target_sigset_t), 1)))
>> >  > +return -TARGET_EFAULT;
>> >  > +target_to_host_sigset(, p);
>> >  > +unlock_user(p, arg2, 0);
>> >  > +set_ptr = 
>> >  >   switch(how) {
>> >  >   case TARGET_SIG_BLOCK:
>> >  >   how = SIG_BLOCK;
>> >  > @@ -9504,11 +9509,6 @@ static abi_long do_syscall1(void
>> *cpu_env, int num, abi_long arg1,
>> >  >   default:
>> >  >   return -TARGET_EINVAL;
>> >  >   }
>> >  > -if (!(p = lock_user(VERIFY_READ, arg2,
>> sizeof(target_sigset_t), 1)))
>> >  > -return -TARGET_EFAULT;
>> >  > -target_to_host_sigset(, p);
>> >  > -unlock_user(p, arg2, 0);
>> >  > -set_ptr = 
>> >  >   } else {
>> >  >   how = 0;
>> >  >   set_ptr = NULL;
>> >
>> > I know it's only code move but generally we also update the style
>> to pass scripts/checkpatch.pl
>> > 
>> > successfully.
>> >
>> >
>> > That is a reasonable request, however, can I just send a follow-on
>> patch?  I didn't write this one
>> > and I honestly don't know much about it, but I don't mind doing the
>> cleanup
>> >
>> >
>> > Could you also update TARGET_NR_sigprocmask in the same way as it
>> seems the kernel behaves like
>> > this
>> > too in this case?
>> >
>> >
>> > I can take a look.  I would prefer then to also prefetch the style
>> fixup in a preceding patch. I
>> > don't recall seeing whether qemu supports clang-format.
>> >
>>
>> There is no problem. You can keep this patch unmodified, and add patches
>> to fix the problems.
>>
>> I only ask to have all the patches in one series.
>>
>
> Will take a swing at this for v2.
>

Laurent,
I spent some time today going over the patches and digging into what
they're actually doing and I think I can make this two patches, both with
the style changes squashed and will send in a v2 today.

Thanks


>
>
>>
>> Thanks,
>> Laurent
>>
>>

[PATCH v4 12/15] hw/nvme: Initialize capability structures for primary/secondary controllers

2022-01-26 Thread Lukasz Maniak

From: Łukasz Gieryk 

With four new properties:
 - sriov_v{i,q}_flexible,
 - sriov_max_v{i,q}_per_vf,
one can configure the number of available flexible resources, as well as
the limits. The primary and secondary controller capability structures
are initialized accordingly.

Since the number of available queues (interrupts) now varies between
VF/PF, BAR size calculation is also adjusted.

Signed-off-by: Łukasz Gieryk 
---
 hw/nvme/ctrl.c   | 142 ---
 hw/nvme/nvme.h   |   4 ++
 include/block/nvme.h |   5 ++
 3 files changed, 144 insertions(+), 7 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index e101cb7d7c..551c8795f2 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -36,6 +36,10 @@
  *  zoned.zasl=, \
  *  zoned.auto_transition=, \
  *  sriov_max_vfs= \
+ *  sriov_vq_flexible= \
+ *  sriov_vi_flexible= \
+ *  sriov_max_vi_per_vf= \
+ *  sriov_max_vq_per_vf= \
  *  subsys=
  *  -device nvme-ns,drive=,bus=,nsid=,\
  *  zoned=, \
@@ -113,6 +117,29 @@
  *   enables reporting of both SR-IOV and ARI capabilities by the NVMe device.
  *   Virtual function controllers will not report SR-IOV capability.
  *
+ *   NOTE: Single Root I/O Virtualization support is experimental.
+ *   All the related parameters may be subject to change.
+ *
+ * - `sriov_vq_flexible`
+ *   Indicates the total number of flexible queue resources assignable to all
+ *   the secondary controllers. Implicitly sets the number of primary
+ *   controller's private resources to `(max_ioqpairs - sriov_vq_flexible)`.
+ *
+ * - `sriov_vi_flexible`
+ *   Indicates the total number of flexible interrupt resources assignable to
+ *   all the secondary controllers. Implicitly sets the number of primary
+ *   controller's private resources to `(msix_qsize - sriov_vi_flexible)`.
+ *
+ * - `sriov_max_vi_per_vf`
+ *   Indicates the maximum number of virtual interrupt resources assignable
+ *   to a secondary controller. The default 0 resolves to
+ *   `(sriov_vi_flexible / sriov_max_vfs)`.
+ *
+ * - `sriov_max_vq_per_vf`
+ *   Indicates the maximum number of virtual queue resources assignable to
+ *   a secondary controller. The default 0 resolves to
+ *   `(sriov_vq_flexible / sriov_max_vfs)`.
+ *
  * nvme namespace device parameters
  * 
  * - `shared`
@@ -184,6 +211,7 @@
 #define NVME_NUM_FW_SLOTS 1
 #define NVME_DEFAULT_MAX_ZA_SIZE (128 * KiB)
 #define NVME_MAX_VFS 127
+#define NVME_VF_RES_GRANULARITY 1
 #define NVME_VF_OFFSET 0x1
 #define NVME_VF_STRIDE 1
 
@@ -6359,6 +6387,54 @@ static void nvme_check_constraints(NvmeCtrl *n, Error 
**errp)
 error_setg(errp, "PMR is not supported with SR-IOV");
 return;
 }
+
+if (!params->sriov_vq_flexible || !params->sriov_vi_flexible) {
+error_setg(errp, "both sriov_vq_flexible and sriov_vi_flexible"
+   " must be set for the use of SR-IOV");
+return;
+}
+
+if (params->sriov_vq_flexible < params->sriov_max_vfs * 2) {
+error_setg(errp, "sriov_vq_flexible must be greater than or equal"
+   " to %d (sriov_max_vfs * 2)", params->sriov_max_vfs * 
2);
+return;
+}
+
+if (params->max_ioqpairs < params->sriov_vq_flexible + 2) {
+error_setg(errp, "sriov_vq_flexible - max_ioqpairs (PF-private"
+   " queue resources) must be greater than or equal to 2");
+return;
+}
+
+if (params->sriov_vi_flexible < params->sriov_max_vfs) {
+error_setg(errp, "sriov_vi_flexible must be greater than or equal"
+   " to %d (sriov_max_vfs)", params->sriov_max_vfs);
+return;
+}
+
+if (params->msix_qsize < params->sriov_vi_flexible + 1) {
+error_setg(errp, "sriov_vi_flexible - msix_qsize (PF-private"
+   " interrupt resources) must be greater than or equal"
+   " to 1");
+return;
+}
+
+if (params->sriov_max_vi_per_vf &&
+(params->sriov_max_vi_per_vf - 1) % NVME_VF_RES_GRANULARITY) {
+error_setg(errp, "sriov_max_vi_per_vf must meet:"
+   " (X - 1) %% %d == 0 and X >= 1",
+   NVME_VF_RES_GRANULARITY);
+return;
+}
+
+if (params->sriov_max_vq_per_vf &&
+(params->sriov_max_vq_per_vf < 2 ||
+ (params->sriov_max_vq_per_vf - 1) % NVME_VF_RES_GRANULARITY)) {
+error_setg(errp, "sriov_max_vq_per_vf must meet:"
+   " (X - 1) %% %d == 0 and X >= 2",
+   NVME_VF_RES_GRANULARITY);
+return;
+}
 }
 }
 
@@ -6367,10 +6443,19 @@ static void nvme_init_state(NvmeCtrl *n)
 NvmePriCtrlCap *cap = >pri_ctrl_cap;

[PATCH v4 10/15] hw/nvme: Remove reg_size variable and update BAR0 size calculation

2022-01-26 Thread Lukasz Maniak

From: Łukasz Gieryk 

The n->reg_size parameter unnecessarily splits the BAR0 size calculation
in two phases; removed to simplify the code.

With all the calculations done in one place, it seems the pow2ceil,
applied originally to reg_size, is unnecessary. The rounding should
happen as the last step, when BAR size includes Nvme registers, queue
registers, and MSIX-related space.

Finally, the size of the mmio memory region is extended to cover the 1st
4KiB padding (see the map below). Access to this range is handled as
interaction with a non-existing queue and generates an error trace, so
actually nothing changes, while the reg_size variable is no longer needed.


|  BAR0|

[Nvme Registers]
[Queues]
[power-of-2 padding] - removed in this patch
[4KiB padding (1)  ]
[MSIX TABLE]
[4KiB padding (2)  ]
[MSIX PBA  ]
[power-of-2 padding]

Signed-off-by: Łukasz Gieryk 
---
 hw/nvme/ctrl.c | 10 +-
 hw/nvme/nvme.h |  1 -
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 426507ca8a..40eb6bd1a8 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6372,9 +6372,6 @@ static void nvme_init_state(NvmeCtrl *n)
 n->conf_ioqpairs = n->params.max_ioqpairs;
 n->conf_msix_qsize = n->params.msix_qsize;
 
-/* add one to max_ioqpairs to account for the admin queue pair */
-n->reg_size = pow2ceil(sizeof(NvmeBar) +
-   2 * (n->params.max_ioqpairs + 1) * NVME_DB_SIZE);
 n->sq = g_new0(NvmeSQueue *, n->params.max_ioqpairs + 1);
 n->cq = g_new0(NvmeCQueue *, n->params.max_ioqpairs + 1);
 n->temperature = NVME_TEMPERATURE;
@@ -6498,7 +6495,10 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice 
*pci_dev, Error **errp)
 pcie_ari_init(pci_dev, 0x100, 1);
 }
 
-bar_size = QEMU_ALIGN_UP(n->reg_size, 4 * KiB);
+/* add one to max_ioqpairs to account for the admin queue pair */
+bar_size = sizeof(NvmeBar) +
+   2 * (n->params.max_ioqpairs + 1) * NVME_DB_SIZE;
+bar_size = QEMU_ALIGN_UP(bar_size, 4 * KiB);
 msix_table_offset = bar_size;
 msix_table_size = PCI_MSIX_ENTRY_SIZE * n->params.msix_qsize;
 
@@ -6512,7 +6512,7 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, 
Error **errp)
 
 memory_region_init(>bar0, OBJECT(n), "nvme-bar0", bar_size);
 memory_region_init_io(>iomem, OBJECT(n), _mmio_ops, n, "nvme",
-  n->reg_size);
+  msix_table_offset);
 memory_region_add_subregion(>bar0, 0, >iomem);
 
 if (pci_is_vf(pci_dev)) {
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 927890b490..1401ac3904 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -414,7 +414,6 @@ typedef struct NvmeCtrl {
 uint16_tmax_prp_ents;
 uint16_tcqe_size;
 uint16_tsqe_size;
-uint32_treg_size;
 uint32_tmax_q_ents;
 uint8_t outstanding_aers;
 uint32_tirq_status;
-- 
2.25.1

[PATCH v4 14/15] docs: Add documentation for SR-IOV and Virtualization Enhancements

2022-01-26 Thread Lukasz Maniak

Signed-off-by: Lukasz Maniak 
---
 docs/system/devices/nvme.rst | 36 
 1 file changed, 36 insertions(+)

diff --git a/docs/system/devices/nvme.rst b/docs/system/devices/nvme.rst
index b5acb2a9c1..166a11abc6 100644
--- a/docs/system/devices/nvme.rst
+++ b/docs/system/devices/nvme.rst
@@ -239,3 +239,39 @@ The virtual namespace device supports DIF- and DIX-based 
protection information
   to ``1`` to transfer protection information as the first eight bytes of
   metadata. Otherwise, the protection information is transferred as the last
   eight bytes.
+
+Virtualization Enhancements and SR-IOV (Experimental Support)
+-
+
+The ``nvme`` device supports Single Root I/O Virtualization and Sharing
+along with Virtualization Enhancements. The controller has to be linked to
+an NVM Subsystem device (``nvme-subsys``) for use with SR-IOV.
+
+A number of parameters are present (**please note, that they may be
+subject to change**):
+
+``sriov_max_vfs`` (default: ``0``)
+  Indicates the maximum number of PCIe virtual functions supported
+  by the controller. Specifying a non-zero value enables reporting of both
+  SR-IOV and ARI (Alternative Routing-ID Interpretation) capabilities
+  by the NVMe device. Virtual function controllers will not report SR-IOV.
+
+``sriov_vq_flexible``
+  Indicates the total number of flexible queue resources assignable to all
+  the secondary controllers. Implicitly sets the number of primary
+  controller's private resources to ``(max_ioqpairs - sriov_vq_flexible)``.
+
+``sriov_vi_flexible``
+  Indicates the total number of flexible interrupt resources assignable to
+  all the secondary controllers. Implicitly sets the number of primary
+  controller's private resources to ``(msix_qsize - sriov_vi_flexible)``.
+
+``sriov_max_vi_per_vf`` (default: ``0``)
+  Indicates the maximum number of virtual interrupt resources assignable
+  to a secondary controller. The default ``0`` resolves to
+  ``(sriov_vi_flexible / sriov_max_vfs)``
+
+``sriov_max_vq_per_vf`` (default: ``0``)
+  Indicates the maximum number of virtual queue resources assignable to
+  a secondary controller. The default ``0`` resolves to
+  ``(sriov_vq_flexible / sriov_max_vfs)``
-- 
2.25.1

[PATCH v4 11/15] hw/nvme: Calculate BAR attributes in a function

2022-01-26 Thread Lukasz Maniak

From: Łukasz Gieryk 

An NVMe device with SR-IOV capability calculates the BAR size
differently for PF and VF, so it makes sense to extract the common code
to a separate function.

Signed-off-by: Łukasz Gieryk 
---
 hw/nvme/ctrl.c | 45 +++--
 1 file changed, 31 insertions(+), 14 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 40eb6bd1a8..e101cb7d7c 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6431,6 +6431,34 @@ static void nvme_init_pmr(NvmeCtrl *n, PCIDevice 
*pci_dev)
 memory_region_set_enabled(>pmr.dev->mr, false);
 }
 
+static uint64_t nvme_bar_size(unsigned total_queues, unsigned total_irqs,
+  unsigned *msix_table_offset,
+  unsigned *msix_pba_offset)
+{
+uint64_t bar_size, msix_table_size, msix_pba_size;
+
+bar_size = sizeof(NvmeBar) + 2 * total_queues * NVME_DB_SIZE;
+bar_size = QEMU_ALIGN_UP(bar_size, 4 * KiB);
+
+if (msix_table_offset) {
+*msix_table_offset = bar_size;
+}
+
+msix_table_size = PCI_MSIX_ENTRY_SIZE * total_irqs;
+bar_size += msix_table_size;
+bar_size = QEMU_ALIGN_UP(bar_size, 4 * KiB);
+
+if (msix_pba_offset) {
+*msix_pba_offset = bar_size;
+}
+
+msix_pba_size = QEMU_ALIGN_UP(total_irqs, 64) / 8;
+bar_size += msix_pba_size;
+
+bar_size = pow2ceil(bar_size);
+return bar_size;
+}
+
 static void nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset,
 uint64_t bar_size)
 {
@@ -6470,7 +6498,7 @@ static int nvme_add_pm_capability(PCIDevice *pci_dev, 
uint8_t offset)
 static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
 {
 uint8_t *pci_conf = pci_dev->config;
-uint64_t bar_size, msix_table_size, msix_pba_size;
+uint64_t bar_size;
 unsigned msix_table_offset, msix_pba_offset;
 int ret;
 
@@ -6496,19 +6524,8 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice 
*pci_dev, Error **errp)
 }
 
 /* add one to max_ioqpairs to account for the admin queue pair */
-bar_size = sizeof(NvmeBar) +
-   2 * (n->params.max_ioqpairs + 1) * NVME_DB_SIZE;
-bar_size = QEMU_ALIGN_UP(bar_size, 4 * KiB);
-msix_table_offset = bar_size;
-msix_table_size = PCI_MSIX_ENTRY_SIZE * n->params.msix_qsize;
-
-bar_size += msix_table_size;
-bar_size = QEMU_ALIGN_UP(bar_size, 4 * KiB);
-msix_pba_offset = bar_size;
-msix_pba_size = QEMU_ALIGN_UP(n->params.msix_qsize, 64) / 8;
-
-bar_size += msix_pba_size;
-bar_size = pow2ceil(bar_size);
+bar_size = nvme_bar_size(n->params.max_ioqpairs + 1, n->params.msix_qsize,
+ _table_offset, _pba_offset);
 
 memory_region_init(>bar0, OBJECT(n), "nvme-bar0", bar_size);
 memory_region_init_io(>iomem, OBJECT(n), _mmio_ops, n, "nvme",
-- 
2.25.1

[PATCH v4 09/15] hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime

2022-01-26 Thread Lukasz Maniak

From: Łukasz Gieryk 

The NVMe device defines two properties: max_ioqpairs, msix_qsize. Having
them as constants is problematic for SR-IOV support.

SR-IOV introduces virtual resources (queues, interrupts) that can be
assigned to PF and its dependent VFs. Each device, following a reset,
should work with the configured number of queues. A single constant is
no longer sufficient to hold the whole state.

This patch tries to solve the problem by introducing additional
variables in NvmeCtrl’s state. The variables for, e.g., managing queues
are therefore organized as:
 - n->params.max_ioqpairs – no changes, constant set by the user
 - n->(mutable_state) – (not a part of this patch) user-configurable,
specifies number of queues available _after_
reset
 - n->conf_ioqpairs - (new) used in all the places instead of the ‘old’
  n->params.max_ioqpairs; initialized in realize()
  and updated during reset() to reflect user’s
  changes to the mutable state

Since the number of available i/o queues and interrupts can change in
runtime, buffers for sq/cqs and the MSIX-related structures are
allocated big enough to handle the limits, to completely avoid the
complicated reallocation. A helper function (nvme_update_msixcap_ts)
updates the corresponding capability register, to signal configuration
changes.

Signed-off-by: Łukasz Gieryk 
---
 hw/nvme/ctrl.c | 52 ++
 hw/nvme/nvme.h |  2 ++
 2 files changed, 38 insertions(+), 16 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index b816b377c3..426507ca8a 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -416,12 +416,12 @@ static bool nvme_nsid_valid(NvmeCtrl *n, uint32_t nsid)
 
 static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
 {
-return sqid < n->params.max_ioqpairs + 1 && n->sq[sqid] != NULL ? 0 : -1;
+return sqid < n->conf_ioqpairs + 1 && n->sq[sqid] != NULL ? 0 : -1;
 }
 
 static int nvme_check_cqid(NvmeCtrl *n, uint16_t cqid)
 {
-return cqid < n->params.max_ioqpairs + 1 && n->cq[cqid] != NULL ? 0 : -1;
+return cqid < n->conf_ioqpairs + 1 && n->cq[cqid] != NULL ? 0 : -1;
 }
 
 static void nvme_inc_cq_tail(NvmeCQueue *cq)
@@ -4035,8 +4035,7 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeRequest 
*req)
 trace_pci_nvme_err_invalid_create_sq_cqid(cqid);
 return NVME_INVALID_CQID | NVME_DNR;
 }
-if (unlikely(!sqid || sqid > n->params.max_ioqpairs ||
-n->sq[sqid] != NULL)) {
+if (unlikely(!sqid || sqid > n->conf_ioqpairs || n->sq[sqid] != NULL)) {
 trace_pci_nvme_err_invalid_create_sq_sqid(sqid);
 return NVME_INVALID_QID | NVME_DNR;
 }
@@ -4388,8 +4387,7 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest 
*req)
 trace_pci_nvme_create_cq(prp1, cqid, vector, qsize, qflags,
  NVME_CQ_FLAGS_IEN(qflags) != 0);
 
-if (unlikely(!cqid || cqid > n->params.max_ioqpairs ||
-n->cq[cqid] != NULL)) {
+if (unlikely(!cqid || cqid > n->conf_ioqpairs || n->cq[cqid] != NULL)) {
 trace_pci_nvme_err_invalid_create_cq_cqid(cqid);
 return NVME_INVALID_QID | NVME_DNR;
 }
@@ -4405,7 +4403,7 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest 
*req)
 trace_pci_nvme_err_invalid_create_cq_vector(vector);
 return NVME_INVALID_IRQ_VECTOR | NVME_DNR;
 }
-if (unlikely(vector >= n->params.msix_qsize)) {
+if (unlikely(vector >= n->conf_msix_qsize)) {
 trace_pci_nvme_err_invalid_create_cq_vector(vector);
 return NVME_INVALID_IRQ_VECTOR | NVME_DNR;
 }
@@ -5002,13 +5000,12 @@ defaults:
 
 break;
 case NVME_NUMBER_OF_QUEUES:
-result = (n->params.max_ioqpairs - 1) |
-((n->params.max_ioqpairs - 1) << 16);
+result = (n->conf_ioqpairs - 1) | ((n->conf_ioqpairs - 1) << 16);
 trace_pci_nvme_getfeat_numq(result);
 break;
 case NVME_INTERRUPT_VECTOR_CONF:
 iv = dw11 & 0x;
-if (iv >= n->params.max_ioqpairs + 1) {
+if (iv >= n->conf_ioqpairs + 1) {
 return NVME_INVALID_FIELD | NVME_DNR;
 }
 
@@ -5163,10 +5160,10 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, 
NvmeRequest *req)
 
 trace_pci_nvme_setfeat_numq((dw11 & 0x) + 1,
 ((dw11 >> 16) & 0x) + 1,
-n->params.max_ioqpairs,
-n->params.max_ioqpairs);
-req->cqe.result = cpu_to_le32((n->params.max_ioqpairs - 1) |
-  ((n->params.max_ioqpairs - 1) << 16));
+n->conf_ioqpairs,
+n->conf_ioqpairs);
+req->cqe.result = cpu_to_le32((n->conf_ioqpairs - 1) |
+  ((n->conf_ioqpairs - 1) << 16));
 break;
 case

[PATCH v4 13/15] hw/nvme: Add support for the Virtualization Management command

2022-01-26 Thread Lukasz Maniak

From: Łukasz Gieryk 

With the new command one can:
 - assign flexible resources (queues, interrupts) to primary and
   secondary controllers,
 - toggle the online/offline state of given controller.

Signed-off-by: Łukasz Gieryk 
---
 hw/nvme/ctrl.c   | 257 ++-
 hw/nvme/nvme.h   |  20 
 hw/nvme/trace-events |   3 +
 include/block/nvme.h |  17 +++
 4 files changed, 295 insertions(+), 2 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 551c8795f2..624db2f9c6 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -188,6 +188,7 @@
 #include "qemu/error-report.h"
 #include "qemu/log.h"
 #include "qemu/units.h"
+#include "qemu/range.h"
 #include "qapi/error.h"
 #include "qapi/visitor.h"
 #include "sysemu/sysemu.h"
@@ -259,6 +260,7 @@ static const uint32_t nvme_cse_acs[256] = {
 [NVME_ADM_CMD_GET_FEATURES] = NVME_CMD_EFF_CSUPP,
 [NVME_ADM_CMD_ASYNC_EV_REQ] = NVME_CMD_EFF_CSUPP,
 [NVME_ADM_CMD_NS_ATTACHMENT]= NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_NIC,
+[NVME_ADM_CMD_VIRT_MNGMT]   = NVME_CMD_EFF_CSUPP,
 [NVME_ADM_CMD_FORMAT_NVM]   = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
 };
 
@@ -290,6 +292,7 @@ static const uint32_t nvme_cse_iocs_zoned[256] = {
 };
 
 static void nvme_process_sq(void *opaque);
+static void nvme_ctrl_reset(NvmeCtrl *n, NvmeResetType rst);
 
 static uint16_t nvme_sqid(NvmeRequest *req)
 {
@@ -5541,6 +5544,167 @@ out:
 return status;
 }
 
+static void nvme_get_virt_res_num(NvmeCtrl *n, uint8_t rt, int *num_total,
+  int *num_prim, int *num_sec)
+{
+*num_total = le32_to_cpu(rt ?
+ n->pri_ctrl_cap.vifrt : n->pri_ctrl_cap.vqfrt);
+*num_prim = le16_to_cpu(rt ?
+n->pri_ctrl_cap.virfap : n->pri_ctrl_cap.vqrfap);
+*num_sec = le16_to_cpu(rt ? n->pri_ctrl_cap.virfa : n->pri_ctrl_cap.vqrfa);
+}
+
+static uint16_t nvme_assign_virt_res_to_prim(NvmeCtrl *n, NvmeRequest *req,
+ uint16_t cntlid, uint8_t rt,
+ int nr)
+{
+int num_total, num_prim, num_sec;
+
+if (cntlid != n->cntlid) {
+return NVME_INVALID_CTRL_ID | NVME_DNR;
+}
+
+nvme_get_virt_res_num(n, rt, _total, _prim, _sec);
+
+if (nr > num_total) {
+return NVME_INVALID_NUM_RESOURCES | NVME_DNR;
+}
+
+if (nr > num_total - num_sec) {
+return NVME_INVALID_RESOURCE_ID | NVME_DNR;
+}
+
+if (rt) {
+n->next_pri_ctrl_cap.virfap = cpu_to_le16(nr);
+} else {
+n->next_pri_ctrl_cap.vqrfap = cpu_to_le16(nr);
+}
+
+req->cqe.result = cpu_to_le32(nr);
+return req->status;
+}
+
+static void nvme_update_virt_res(NvmeCtrl *n, NvmeSecCtrlEntry *sctrl,
+ uint8_t rt, int nr)
+{
+int prev_nr, prev_total;
+
+if (rt) {
+prev_nr = le16_to_cpu(sctrl->nvi);
+prev_total = le32_to_cpu(n->pri_ctrl_cap.virfa);
+sctrl->nvi = cpu_to_le16(nr);
+n->pri_ctrl_cap.virfa = cpu_to_le32(prev_total + nr - prev_nr);
+} else {
+prev_nr = le16_to_cpu(sctrl->nvq);
+prev_total = le32_to_cpu(n->pri_ctrl_cap.vqrfa);
+sctrl->nvq = cpu_to_le16(nr);
+n->pri_ctrl_cap.vqrfa = cpu_to_le32(prev_total + nr - prev_nr);
+}
+}
+
+static uint16_t nvme_assign_virt_res_to_sec(NvmeCtrl *n, NvmeRequest *req,
+uint16_t cntlid, uint8_t rt, int 
nr)
+{
+int num_total, num_prim, num_sec, num_free, diff, limit;
+NvmeSecCtrlEntry *sctrl;
+
+sctrl = nvme_sctrl_for_cntlid(n, cntlid);
+if (!sctrl) {
+return NVME_INVALID_CTRL_ID | NVME_DNR;
+}
+
+if (sctrl->scs) {
+return NVME_INVALID_SEC_CTRL_STATE | NVME_DNR;
+}
+
+limit = le16_to_cpu(rt ? n->pri_ctrl_cap.vifrsm : n->pri_ctrl_cap.vqfrsm);
+if (nr > limit) {
+return NVME_INVALID_NUM_RESOURCES | NVME_DNR;
+}
+
+nvme_get_virt_res_num(n, rt, _total, _prim, _sec);
+num_free = num_total - num_prim - num_sec;
+diff = nr - le16_to_cpu(rt ? sctrl->nvi : sctrl->nvq);
+
+if (diff > num_free) {
+return NVME_INVALID_RESOURCE_ID | NVME_DNR;
+}
+
+nvme_update_virt_res(n, sctrl, rt, nr);
+req->cqe.result = cpu_to_le32(nr);
+
+return req->status;
+}
+
+static uint16_t nvme_virt_set_state(NvmeCtrl *n, uint16_t cntlid, bool online)
+{
+NvmeCtrl *sn = NULL;
+NvmeSecCtrlEntry *sctrl;
+int vf_index;
+
+sctrl = nvme_sctrl_for_cntlid(n, cntlid);
+if (!sctrl) {
+return NVME_INVALID_CTRL_ID | NVME_DNR;
+}
+
+if (!pci_is_vf(>parent_obj)) {
+vf_index = le16_to_cpu(sctrl->vfn) - 1;
+sn = NVME(pcie_sriov_get_vf_at_index(>parent_obj, vf_index));
+}
+
+if (online) {
+if (!sctrl->nvi || (le16_to_cpu(sctrl->nvq) < 2) || !sn) {
+return NVME_INVALID_SEC_CTRL_STATE | NVME_DNR;
+}
+
+if

[PATCH v4 07/15] hw/nvme: Add support for Secondary Controller List

2022-01-26 Thread Lukasz Maniak

Introduce handling for Secondary Controller List (Identify command with
CNS value of 15h).

Secondary controller ids are unique in the subsystem, hence they are
reserved by it upon initialization of the primary controller to the
number of sriov_max_vfs.

ID reservation requires the addition of an intermediate controller slot
state, so the reserved controller has the address 0x.
A secondary controller is in the reserved state when it has no virtual
function assigned, but its primary controller is realized.
Secondary controller reservations are released to NULL when its primary
controller is unregistered.

Signed-off-by: Lukasz Maniak 
---
 hw/nvme/ctrl.c   | 35 +
 hw/nvme/ns.c |  2 +-
 hw/nvme/nvme.h   | 18 +++
 hw/nvme/subsys.c | 75 ++--
 hw/nvme/trace-events |  1 +
 include/block/nvme.h | 20 
 6 files changed, 141 insertions(+), 10 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 1eb1c3df03..9ee5f83aa1 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -4552,6 +4552,29 @@ static uint16_t nvme_identify_pri_ctrl_cap(NvmeCtrl *n, 
NvmeRequest *req)
 sizeof(NvmePriCtrlCap), req);
 }
 
+static uint16_t nvme_identify_sec_ctrl_list(NvmeCtrl *n, NvmeRequest *req)
+{
+NvmeIdentify *c = (NvmeIdentify *)>cmd;
+uint16_t pri_ctrl_id = le16_to_cpu(n->pri_ctrl_cap.cntlid);
+uint16_t min_id = le16_to_cpu(c->ctrlid);
+uint8_t num_sec_ctrl = n->sec_ctrl_list.numcntl;
+NvmeSecCtrlList list = {0};
+uint8_t i;
+
+for (i = 0; i < num_sec_ctrl; i++) {
+if (n->sec_ctrl_list.sec[i].scid >= min_id) {
+list.numcntl = num_sec_ctrl - i;
+memcpy(, n->sec_ctrl_list.sec + i,
+   list.numcntl * sizeof(NvmeSecCtrlEntry));
+break;
+}
+}
+
+trace_pci_nvme_identify_sec_ctrl_list(pri_ctrl_id, list.numcntl);
+
+return nvme_c2h(n, (uint8_t *), sizeof(list), req);
+}
+
 static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req,
  bool active)
 {
@@ -4772,6 +4795,8 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeRequest 
*req)
 return nvme_identify_ctrl_list(n, req, false);
 case NVME_ID_CNS_PRIMARY_CTRL_CAP:
 return nvme_identify_pri_ctrl_cap(n, req);
+case NVME_ID_CNS_SECONDARY_CTRL_LIST:
+return nvme_identify_sec_ctrl_list(n, req);
 case NVME_ID_CNS_CS_NS:
 return nvme_identify_ns_csi(n, req, true);
 case NVME_ID_CNS_CS_NS_PRESENT:
@@ -6323,6 +6348,9 @@ static void nvme_check_constraints(NvmeCtrl *n, Error 
**errp)
 static void nvme_init_state(NvmeCtrl *n)
 {
 NvmePriCtrlCap *cap = >pri_ctrl_cap;
+NvmeSecCtrlList *list = >sec_ctrl_list;
+NvmeSecCtrlEntry *sctrl;
+int i;
 
 /* add one to max_ioqpairs to account for the admin queue pair */
 n->reg_size = pow2ceil(sizeof(NvmeBar) +
@@ -6334,6 +6362,13 @@ static void nvme_init_state(NvmeCtrl *n)
 n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
 n->aer_reqs = g_new0(NvmeRequest *, n->params.aerl + 1);
 
+list->numcntl = cpu_to_le16(n->params.sriov_max_vfs);
+for (i = 0; i < n->params.sriov_max_vfs; i++) {
+sctrl = >sec[i];
+sctrl->pcid = cpu_to_le16(n->cntlid);
+sctrl->vfn = cpu_to_le16(i + 1);
+}
+
 cap->cntlid = cpu_to_le16(n->cntlid);
 }
 
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index 8b5f98c761..e7a54ac572 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -511,7 +511,7 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
 for (i = 0; i < ARRAY_SIZE(subsys->ctrls); i++) {
 NvmeCtrl *ctrl = subsys->ctrls[i];
 
-if (ctrl) {
+if (ctrl && ctrl != SUBSYS_SLOT_RSVD) {
 nvme_attach_ns(ctrl, ns);
 }
 }
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 81deb45dfb..2157a7b95f 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -43,6 +43,7 @@ typedef struct NvmeBus {
 #define TYPE_NVME_SUBSYS "nvme-subsys"
 #define NVME_SUBSYS(obj) \
 OBJECT_CHECK(NvmeSubsystem, (obj), TYPE_NVME_SUBSYS)
+#define SUBSYS_SLOT_RSVD (void *)0x
 
 typedef struct NvmeSubsystem {
 DeviceState parent_obj;
@@ -67,6 +68,10 @@ static inline NvmeCtrl *nvme_subsys_ctrl(NvmeSubsystem 
*subsys,
 return NULL;
 }
 
+if (subsys->ctrls[cntlid] == SUBSYS_SLOT_RSVD) {
+return NULL;
+}
+
 return subsys->ctrls[cntlid];
 }
 
@@ -463,6 +468,7 @@ typedef struct NvmeCtrl {
 } features;
 
 NvmePriCtrlCap  pri_ctrl_cap;
+NvmeSecCtrlList sec_ctrl_list;
 } NvmeCtrl;
 
 static inline NvmeNamespace *nvme_ns(NvmeCtrl *n, uint32_t nsid)
@@ -497,6 +503,18 @@ static inline uint16_t nvme_cid(NvmeRequest *req)
 return le16_to_cpu(req->cqe.cid);
 }
 
+static inline NvmeSecCtrlEntry *nvme_sctrl(NvmeCtrl *n)
+{
+PCIDevice *pci_dev = >parent_obj;
+NvmeCtrl *pf =

1 2 3 >

1 - 100 of 237 matches

Mail list logo