Re: [RFC PATCH v2 0/8] Removal of AioContext lock, bs->parents and ->children: new rwlock

2022-04-26 Thread Emanuele Giuseppe Esposito



Am 26/04/2022 um 10:51 schrieb Emanuele Giuseppe Esposito:
> Luckly, most of the cases where we recursively go through a graph are
> the BlockDriverState callback functions in block_int-common.h
> In order to understand what to protect, I categorized the callbacks in
> block_int-common.h depending on the type of function that calls them:
> 
> 1) If the caller is a generated_co_wrapper, this function must be
>protected by rdlock. The reason is that generated_co_wrapper create
>coroutines that run in the given bs AioContext, so it doesn't matter
>if we are running in the main loop or not, the coroutine might run
>in an iothread.
> 2) If the caller calls it directly, and has the GLOBAL_STATE_CODE() macro,
>then the function is safe. The main loop is the writer and thus won't
>read and write at the same time.
> 3) If the caller calls it directly, but has not the GLOBAL_STATE_CODE()
>macro, then we need to check the callers and see case-by-case if the
>caller is in the main loop, if it needs to take the lock, or delegate
>this duty to its caller (to reduce the places where to take it).
> 
> I used the vrc script (https://github.com/bonzini/vrc) to get help finding
> all the callers of a callback. Using its filter function, I can
> omit all functions protected by the added lock to avoid having duplicates
> when querying for new callbacks.

I was wondering, if a function is in category (3) and runs in an
Iothread but the function itself is not (currently) recursive, meaning
it doesn't really traverse the graph or calls someone that traverses it,
should I add the rdlock anyways or not?

Example: bdrv_co_drain_end

Pros:
   + Covers if in future a new recursive callback for a new/existing
 BlockDriver is implemented.
   + Covers also the case where I or someone missed the recursive part.

Cons:
   - Potentially introducing an unnecessary critical section.

What do you think?

Emanuele




Re: [PATCH v2 13/42] i386: Destructive vector helpers for AVX

2022-04-26 Thread Paolo Bonzini

On 4/25/22 00:01, Paul Brook wrote:

+#define SHUFFLE4(F, a, b, offset) do {  \
+r0 = a->F((order & 3) + offset);\
+r1 = a->F(((order >> 2) & 3) + offset); \
+r2 = b->F(((order >> 4) & 3) + offset); \
+r3 = b->F(((order >> 6) & 3) + offset); \
+d->F(offset) = r0;  \
+d->F(offset + 1) = r1;  \
+d->F(offset + 2) = r2;  \
+d->F(offset + 3) = r3;  \
+} while (0)
+
  #if SHIFT == 0
  void glue(helper_pshufw, SUFFIX)(Reg *d, Reg *s, int order)
  {
-Reg r;
+uint16_t r0, r1, r2, r3;
  
-r.W(0) = s->W(order & 3);

-r.W(1) = s->W((order >> 2) & 3);
-r.W(2) = s->W((order >> 4) & 3);
-r.W(3) = s->W((order >> 6) & 3);
-MOVE(*d, r);
+SHUFFLE4(W, s, s, 0);


I am not particularly attached to the MOVE macro, but replacing the Reg 
variable with scalars seems worse.


Paolo



Re: [PATCH v2 10/42] i386: Rewrite vector shift helper

2022-04-26 Thread Paolo Bonzini

On 4/25/22 23:33, Richard Henderson wrote:

I do not think it worthwhile to unroll these loops by hand.


Totally agree, as it would also remove most of the uses of 
XMM_ONLY/YMM_ONLY.


I also saw GCC -Warray-bounds complain about

if (SHIFT >= 1) {
d->elem[8] = s->elem[8];
}

though this should probably treated as a GCC bug.

Paolo


If we're that keen on it, it should be written

#pragma GCC unroll 4 << SHIFT
     for (i = 0; i < 4 << SHIFT; ++i) {
     something
     }





[PATCH] linux-user: Add PowerPC ISA 3.1 and MMA to hwcap

2022-04-26 Thread Joel Stanley
These are new hwcap bits added for power10.

Signed-off-by: Joel Stanley 
---
MMA support for TCG is on the list so I think it makes sense for this to
land after those are merged.

I tested my patch with this program:

 https://github.com/shenki/p10_tests

$ qemu-ppc64le -cpu power10  -L ~/ppc64le/ ./test -c
HWCAP: 0x58000580 HWCAP2: 0x8ee6
ISAv3.1: Yes
MMA: Yes

$ qemu-ppc64le -cpu power9  -L ~/ppc64le/ ./test -c
HWCAP: 0x58000580 HWCAP2: 0x8ee0
ISAv3.1: No
MMA: No

 linux-user/elfload.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/linux-user/elfload.c b/linux-user/elfload.c
index 61063fd974e5..0908692e62b3 100644
--- a/linux-user/elfload.c
+++ b/linux-user/elfload.c
@@ -779,6 +779,8 @@ enum {
 QEMU_PPC_FEATURE2_DARN = 0x0020, /* darn random number insn */
 QEMU_PPC_FEATURE2_SCV = 0x0010, /* scv syscall */
 QEMU_PPC_FEATURE2_HTM_NO_SUSPEND = 0x0008, /* TM w/o suspended state */
+QEMU_PPC_FEATURE2_ARCH_3_1 = 0x0004, /* ISA 3.1 */
+QEMU_PPC_FEATURE2_MMA = 0x0002, /* Matrix-Multiply Assist */
 };
 
 #define ELF_HWCAP get_elf_hwcap()
@@ -836,6 +838,8 @@ static uint32_t get_elf_hwcap2(void)
   QEMU_PPC_FEATURE2_VEC_CRYPTO);
 GET_FEATURE2(PPC2_ISA300, QEMU_PPC_FEATURE2_ARCH_3_00 |
  QEMU_PPC_FEATURE2_DARN | QEMU_PPC_FEATURE2_HAS_IEEE128);
+GET_FEATURE2(PPC2_ISA310, QEMU_PPC_FEATURE2_ARCH_3_1 |
+ QEMU_PPC_FEATURE2_MMA);
 
 #undef GET_FEATURE
 #undef GET_FEATURE2
-- 
2.35.1




Re: [RFC PATCH 0/7] VSX MMA Implementation

2022-04-26 Thread Joel Stanley
On Tue, 26 Apr 2022 at 12:51, Lucas Mateus Castro(alqotel)
 wrote:
>
> From: "Lucas Mateus Castro (alqotel)" 
>
> This patch series is an RFC of the Matrix-Multiply Assist (MMA)
> instructions implementation from the PowerISA 3.1
>
> These and the VDIV/VMOD implementation are the last new PowerISA 3.1
> instructions left to be implemented.
>
> Thanks
> Lucas Mateus Castro (alqotel) (7):
>   target/ppc: Implement xxm[tf]acc and xxsetaccz
>   target/ppc: Implemented xvi*ger* instructions
>   target/ppc: Implemented pmxvi*ger* instructions
>   target/ppc: Implemented xvf*ger*
>   target/ppc: Implemented xvf16ger*
>   target/ppc: Implemented pmxvf*ger*
>   target/ppc: Implemented [pm]xvbf16ger2*

I have a small test case for the MMA instructions that Alistair wrote
a while back[1]. It passes when run with these patches applied
(previously it would sigill).

$ qemu-ppc64le -cpu power10  -L ~/ppc64le/ ./test -m
Smoke test MMA
MMA[0] = 1 (Correct)
MMA[1] = 2 (Correct)
MMA[2] = 3 (Correct)
MMA[3] = 4 (Correct)
MMA[4] = 2 (Correct)
MMA[5] = 4 (Correct)
MMA[6] = 6 (Correct)
MMA[7] = 8 (Correct)
MMA[8] = 3 (Correct)
MMA[9] = 6 (Correct)
MMA[10] = 9 (Correct)
MMA[11] = 12 (Correct)
MMA[12] = 4 (Correct)
MMA[13] = 8 (Correct)
MMA[14] = 12 (Correct)
MMA[15] = 16 (Correct)

[1] https://github.com/shenki/p10_tests


>
>  include/fpu/softfloat.h |   9 ++
>  target/ppc/cpu.h|  15 +++
>  target/ppc/fpu_helper.c | 130 ++
>  target/ppc/helper.h |   7 +
>  target/ppc/insn32.decode|  49 +++
>  target/ppc/insn64.decode|  80 +++
>  target/ppc/int_helper.c |  85 
>  target/ppc/internal.h   |  28 
>  target/ppc/translate/vsx-impl.c.inc | 200 
>  9 files changed, 603 insertions(+)
>
> --
> 2.31.1
>
>



Re: [PATCH v22 0/8] support dirty restraint on vCPU

2022-04-26 Thread Markus Armbruster
Peter Xu  writes:

> Hi, Yong,
>
> On Mon, Apr 25, 2022 at 12:52:45AM +0800, Hyman wrote:
>> Ping.
>>  Hi, David and Peter, how do you think this patchset?
>>  Is it suitable for queueing ? or is there still something need to be done ?
>
> It keeps looking good to me in general, let's see whether the maintainers
> have any comments.  Thanks,

Whose review is still needed?  Or is this ready?




Re: [PULL 0/6] Coverity patch queue

2022-04-26 Thread Richard Henderson

On 4/26/22 21:39, Richard Henderson wrote:

The following changes since commit a72d9008092e39c2c37e47a91bae4e170d0f1b33:

   Merge tag 'pull-nbd-2022-04-26' of https://repo.or.cz/qemu/ericb into 
staging (2022-04-26 14:39:09 -0700)

are available in the Git repository at:

   https://gitlab.com/rth7680/qemu.git tags/pull-tcg-20220426

for you to fetch changes up to dee3fcfbb399a0e4ccedbf737b5b0b7f56ecd398:

   softfloat: Use FloatRelation for fracN_cmp (2022-04-26 20:01:55 -0700)


Fix s390x ICMH cc computation.
Minor adjustments to satisfy Coverity.


Applied, thanks.  Please update https://wiki.qemu.org/ChangeLog/7.1 as 
appropriate.


r~






Richard Henderson (6):
   accel/tcg: Assert mmu_idx in range before use in cputlb
   target/s390x: Fix the accumulation of ccm in op_icm
   target/i386: Suppress coverity warning on fsave/frstor
   softfloat: Fix declaration of partsN_compare
   softfloat: Use FloatRelation within partsN_compare
   softfloat: Use FloatRelation for fracN_cmp

  accel/tcg/cputlb.c   | 40 +++-
  fpu/softfloat.c  | 20 +++-
  target/i386/tcg/fpu_helper.c |  4 ++--
  target/s390x/tcg/translate.c |  2 +-
  fpu/softfloat-parts.c.inc| 11 +++
  5 files changed, 48 insertions(+), 29 deletions(-)





[PATCH] target/arm: Use field names for accessing DBGWCRn

2022-04-26 Thread Richard Henderson
While defining these names, use the correct field width of 5 not 4 for
DBGWCR.MASK.  This typo prevented setting a watchpoint larger than 32k.

Reported-by: Chris Howard 
Signed-off-by: Richard Henderson 
---
 target/arm/internals.h| 12 
 target/arm/debug_helper.c | 10 +-
 target/arm/helper.c   |  8 
 target/arm/kvm64.c| 14 +++---
 4 files changed, 28 insertions(+), 16 deletions(-)

diff --git a/target/arm/internals.h b/target/arm/internals.h
index 9556e3b29e..255833479d 100644
--- a/target/arm/internals.h
+++ b/target/arm/internals.h
@@ -81,6 +81,18 @@ FIELD(V7M_EXCRET, RES1, 7, 25) /* including the must-be-1 
prefix */
  */
 #define FNC_RETURN_MIN_MAGIC 0xfefe
 
+/* Bit definitions for DBGWCRn and DBGWCRn_EL1 */
+FIELD(DBGWCR, E, 0, 1)
+FIELD(DBGWCR, PAC, 1, 2)
+FIELD(DBGWCR, LSC, 3, 2)
+FIELD(DBGWCR, BAS, 5, 8)
+FIELD(DBGWCR, HMC, 13, 1)
+FIELD(DBGWCR, SSC, 14, 2)
+FIELD(DBGWCR, LBN, 16, 4)
+FIELD(DBGWCR, WT, 20, 1)
+FIELD(DBGWCR, MASK, 24, 5)
+FIELD(DBGWCR, SSCE, 29, 1)
+
 /* We use a few fake FSR values for internal purposes in M profile.
  * M profile cores don't have A/R format FSRs, but currently our
  * get_phys_addr() code assumes A/R profile and reports failures via
diff --git a/target/arm/debug_helper.c b/target/arm/debug_helper.c
index 32f3caec23..46893697cc 100644
--- a/target/arm/debug_helper.c
+++ b/target/arm/debug_helper.c
@@ -143,9 +143,9 @@ static bool bp_wp_matches(ARMCPU *cpu, int n, bool is_wp)
  * Non-Secure to simplify the code slightly compared to the full
  * table in the ARM ARM.
  */
-pac = extract64(cr, 1, 2);
-hmc = extract64(cr, 13, 1);
-ssc = extract64(cr, 14, 2);
+pac = FIELD_EX64(cr, DBGWCR, PAC);
+hmc = FIELD_EX64(cr, DBGWCR, HMC);
+ssc = FIELD_EX64(cr, DBGWCR, SSC);
 
 switch (ssc) {
 case 0:
@@ -184,8 +184,8 @@ static bool bp_wp_matches(ARMCPU *cpu, int n, bool is_wp)
 g_assert_not_reached();
 }
 
-wt = extract64(cr, 20, 1);
-lbn = extract64(cr, 16, 4);
+wt = FIELD_EX64(cr, DBGWCR, WT);
+lbn = FIELD_EX64(cr, DBGWCR, LBN);
 
 if (wt && !linked_bp_matches(cpu, lbn)) {
 return false;
diff --git a/target/arm/helper.c b/target/arm/helper.c
index 63397bbac1..5a244c3ed9 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -6320,12 +6320,12 @@ void hw_watchpoint_update(ARMCPU *cpu, int n)
 env->cpu_watchpoint[n] = NULL;
 }
 
-if (!extract64(wcr, 0, 1)) {
+if (!FIELD_EX64(wcr, DBGWCR, E)) {
 /* E bit clear : watchpoint disabled */
 return;
 }
 
-switch (extract64(wcr, 3, 2)) {
+switch (FIELD_EX64(wcr, DBGWCR, LSC)) {
 case 0:
 /* LSC 00 is reserved and must behave as if the wp is disabled */
 return;
@@ -6344,7 +6344,7 @@ void hw_watchpoint_update(ARMCPU *cpu, int n)
  * CONSTRAINED UNPREDICTABLE; we opt to ignore BAS in this case,
  * thus generating a watchpoint for every byte in the masked region.
  */
-mask = extract64(wcr, 24, 4);
+mask = FIELD_EX64(wcr, DBGWCR, MASK);
 if (mask == 1 || mask == 2) {
 /* Reserved values of MASK; we must act as if the mask value was
  * some non-reserved value, or as if the watchpoint were disabled.
@@ -6361,7 +6361,7 @@ void hw_watchpoint_update(ARMCPU *cpu, int n)
 wvr &= ~(len - 1);
 } else {
 /* Watchpoint covers bytes defined by the byte address select bits */
-int bas = extract64(wcr, 5, 8);
+int bas = FIELD_EX64(wcr, DBGWCR, BAS);
 int basstart;
 
 if (extract64(wvr, 2, 1)) {
diff --git a/target/arm/kvm64.c b/target/arm/kvm64.c
index 17dd2f77d9..b8cfaf5782 100644
--- a/target/arm/kvm64.c
+++ b/target/arm/kvm64.c
@@ -208,7 +208,7 @@ static int insert_hw_watchpoint(target_ulong addr,
 target_ulong len, int type)
 {
 HWWatchpoint wp = {
-.wcr = 1, /* E=1, enable */
+.wcr = R_DBGWCR_E_MASK, /* E=1, enable */
 .wvr = addr & (~0x7ULL),
 .details = { .vaddr = addr, .len = len }
 };
@@ -221,19 +221,19 @@ static int insert_hw_watchpoint(target_ulong addr,
  * HMC=0 SSC=0 PAC=3 will hit EL0 or EL1, any security state,
  * valid whether EL3 is implemented or not
  */
-wp.wcr = deposit32(wp.wcr, 1, 2, 3);
+wp.wcr = FIELD_DP64(wp.wcr, DBGWCR, PAC, 3);
 
 switch (type) {
 case GDB_WATCHPOINT_READ:
-wp.wcr = deposit32(wp.wcr, 3, 2, 1);
+wp.wcr = FIELD_DP64(wp.wcr, DBGWCR, LSC, 1);
 wp.details.flags = BP_MEM_READ;
 break;
 case GDB_WATCHPOINT_WRITE:
-wp.wcr = deposit32(wp.wcr, 3, 2, 2);
+wp.wcr = FIELD_DP64(wp.wcr, DBGWCR, LSC, 2);
 wp.details.flags = BP_MEM_WRITE;
 break;
 case GDB_WATCHPOINT_ACCESS:
-wp.wcr = deposit32(wp.wcr, 3, 2, 3);
+wp.wcr = FIELD_DP64(wp.wcr, DBGWCR, LSC, 3);
 wp.details.flags = BP_MEM_ACCESS;
 break;
 default:
@@ -252

Re: Possible bug when setting aarch64 watchpoints

2022-04-26 Thread Richard Henderson

On 4/24/22 04:40, Chris Howard wrote:

Hi, I’m new to qemu (and even bug-reporting) so apologies in advance…

The MASK field in DBGWCRx_EL1 is **5** bits wide [28:24].

In target/arm/kvm64.c I found the line:

  wp.wcr = deposit32(wp.wcr, 24, 4, bits);  // ie **4** bits 
instead of **5**


If it’s not copying (or calculating?) the number of bits correctly this would 
explain the behaviour I’m seeing:

If I set

DBGWVR0_EL1 = 0x0080

and

DBGWCR0_EL1 = 0x1F<<24 | 0xFF<<5 | 0b11<<3 | 0b11<<1 | 0b1<<0

and then access  memory [0x00807FFF]  I get a watchpoint exception. (ie. 
watchpoints ARE working/enabled)

But if I access [0x00808] I *don’t* get an exception.

**If the MASK field gets set to 0b instead of 0b1 then only the bottom 
15 bits of the address get masked (instead of 31) and the masked address isn’t 
0x0080, and the exception won’t be triggered.**


Unfortunately, changing the 4 to a 5 and recompiling had no effect :-(

I may well have misunderstood something. :-/


You're not.  It's a typo, repeated twice.  Will fix.


r~



[PULL 5/6] softfloat: Use FloatRelation within partsN_compare

2022-04-26 Thread Richard Henderson
As the return type is FloatRelation, it's clearer to
use the type for 'cmp' within the function.

Signed-off-by: Richard Henderson 
Reviewed-by: Peter Maydell 
Message-Id: <20220401132240.79730-3-richard.hender...@linaro.org>
---
 fpu/softfloat-parts.c.inc | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/fpu/softfloat-parts.c.inc b/fpu/softfloat-parts.c.inc
index db3e1f393d..bbeadaa189 100644
--- a/fpu/softfloat-parts.c.inc
+++ b/fpu/softfloat-parts.c.inc
@@ -1327,16 +1327,19 @@ static FloatRelation partsN(compare)(FloatPartsN *a, 
FloatPartsN *b,
  float_status *s, bool is_quiet)
 {
 int ab_mask = float_cmask(a->cls) | float_cmask(b->cls);
-int cmp;
 
 if (likely(ab_mask == float_cmask_normal)) {
+FloatRelation cmp;
+
 if (a->sign != b->sign) {
 goto a_sign;
 }
-if (a->exp != b->exp) {
-cmp = a->exp < b->exp ? -1 : 1;
-} else {
+if (a->exp == b->exp) {
 cmp = frac_cmp(a, b);
+} else if (a->exp < b->exp) {
+cmp = float_relation_less;
+} else {
+cmp = float_relation_greater;
 }
 if (a->sign) {
 cmp = -cmp;
-- 
2.34.1




[PULL 1/6] accel/tcg: Assert mmu_idx in range before use in cputlb

2022-04-26 Thread Richard Henderson
Coverity reports out-of-bound accesses within cputlb.c.
This should be a false positive due to how the index is
decoded from MemOpIdx.  To be fair, nothing is checking
the correct bounds during encoding either.

Assert index in range before use, both to catch user errors
and to pacify static analysis.

Fixes: Coverity CID 1487120, 1487127, 1487170, 1487196, 1487215, 1487238
Signed-off-by: Richard Henderson 
Reviewed-by: Peter Maydell 
Reviewed-by: Alex Bennée 
Message-Id: <20220401170813.318609-1-richard.hender...@linaro.org>
---
 accel/tcg/cputlb.c | 40 +++-
 1 file changed, 27 insertions(+), 13 deletions(-)

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index dd45e0467b..f90f4312ea 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -1761,7 +1761,7 @@ static void *atomic_mmu_lookup(CPUArchState *env, 
target_ulong addr,
MemOpIdx oi, int size, int prot,
uintptr_t retaddr)
 {
-size_t mmu_idx = get_mmuidx(oi);
+uintptr_t mmu_idx = get_mmuidx(oi);
 MemOp mop = get_memop(oi);
 int a_bits = get_alignment_bits(mop);
 uintptr_t index;
@@ -1769,6 +1769,8 @@ static void *atomic_mmu_lookup(CPUArchState *env, 
target_ulong addr,
 target_ulong tlb_addr;
 void *hostaddr;
 
+tcg_debug_assert(mmu_idx < NB_MMU_MODES);
+
 /* Adjust the given return address.  */
 retaddr -= GETPC_ADJ;
 
@@ -1908,18 +1910,20 @@ load_helper(CPUArchState *env, target_ulong addr, 
MemOpIdx oi,
 uintptr_t retaddr, MemOp op, bool code_read,
 FullLoadHelper *full_load)
 {
-uintptr_t mmu_idx = get_mmuidx(oi);
-uintptr_t index = tlb_index(env, mmu_idx, addr);
-CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr);
-target_ulong tlb_addr = code_read ? entry->addr_code : entry->addr_read;
 const size_t tlb_off = code_read ?
 offsetof(CPUTLBEntry, addr_code) : offsetof(CPUTLBEntry, addr_read);
 const MMUAccessType access_type =
 code_read ? MMU_INST_FETCH : MMU_DATA_LOAD;
-unsigned a_bits = get_alignment_bits(get_memop(oi));
+const unsigned a_bits = get_alignment_bits(get_memop(oi));
+const size_t size = memop_size(op);
+uintptr_t mmu_idx = get_mmuidx(oi);
+uintptr_t index;
+CPUTLBEntry *entry;
+target_ulong tlb_addr;
 void *haddr;
 uint64_t res;
-size_t size = memop_size(op);
+
+tcg_debug_assert(mmu_idx < NB_MMU_MODES);
 
 /* Handle CPU specific unaligned behaviour */
 if (addr & ((1 << a_bits) - 1)) {
@@ -1927,6 +1931,10 @@ load_helper(CPUArchState *env, target_ulong addr, 
MemOpIdx oi,
  mmu_idx, retaddr);
 }
 
+index = tlb_index(env, mmu_idx, addr);
+entry = tlb_entry(env, mmu_idx, addr);
+tlb_addr = code_read ? entry->addr_code : entry->addr_read;
+
 /* If the TLB entry is for a different page, reload and try again.  */
 if (!tlb_hit(tlb_addr, addr)) {
 if (!victim_tlb_hit(env, mmu_idx, index, tlb_off,
@@ -2310,14 +2318,16 @@ static inline void QEMU_ALWAYS_INLINE
 store_helper(CPUArchState *env, target_ulong addr, uint64_t val,
  MemOpIdx oi, uintptr_t retaddr, MemOp op)
 {
-uintptr_t mmu_idx = get_mmuidx(oi);
-uintptr_t index = tlb_index(env, mmu_idx, addr);
-CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr);
-target_ulong tlb_addr = tlb_addr_write(entry);
 const size_t tlb_off = offsetof(CPUTLBEntry, addr_write);
-unsigned a_bits = get_alignment_bits(get_memop(oi));
+const unsigned a_bits = get_alignment_bits(get_memop(oi));
+const size_t size = memop_size(op);
+uintptr_t mmu_idx = get_mmuidx(oi);
+uintptr_t index;
+CPUTLBEntry *entry;
+target_ulong tlb_addr;
 void *haddr;
-size_t size = memop_size(op);
+
+tcg_debug_assert(mmu_idx < NB_MMU_MODES);
 
 /* Handle CPU specific unaligned behaviour */
 if (addr & ((1 << a_bits) - 1)) {
@@ -2325,6 +2335,10 @@ store_helper(CPUArchState *env, target_ulong addr, 
uint64_t val,
  mmu_idx, retaddr);
 }
 
+index = tlb_index(env, mmu_idx, addr);
+entry = tlb_entry(env, mmu_idx, addr);
+tlb_addr = tlb_addr_write(entry);
+
 /* If the TLB entry is for a different page, reload and try again.  */
 if (!tlb_hit(tlb_addr, addr)) {
 if (!victim_tlb_hit(env, mmu_idx, index, tlb_off,
-- 
2.34.1




[PULL 6/6] softfloat: Use FloatRelation for fracN_cmp

2022-04-26 Thread Richard Henderson
Since the caller, partsN_compare, is now exclusively
using FloatRelation, it's clearer to use it here too.

Signed-off-by: Richard Henderson 
Reviewed-by: Peter Maydell 
Message-Id: <20220401132240.79730-4-richard.hender...@linaro.org>
---
 fpu/softfloat.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/fpu/softfloat.c b/fpu/softfloat.c
index e7d7ad56bc..4a871ef2a1 100644
--- a/fpu/softfloat.c
+++ b/fpu/softfloat.c
@@ -957,21 +957,23 @@ static void frac128_allones(FloatParts128 *a)
 
 #define frac_allones(A)  FRAC_GENERIC_64_128(allones, A)(A)
 
-static int frac64_cmp(FloatParts64 *a, FloatParts64 *b)
+static FloatRelation frac64_cmp(FloatParts64 *a, FloatParts64 *b)
 {
-return a->frac == b->frac ? 0 : a->frac < b->frac ? -1 : 1;
+return (a->frac == b->frac ? float_relation_equal
+: a->frac < b->frac ? float_relation_less
+: float_relation_greater);
 }
 
-static int frac128_cmp(FloatParts128 *a, FloatParts128 *b)
+static FloatRelation frac128_cmp(FloatParts128 *a, FloatParts128 *b)
 {
 uint64_t ta = a->frac_hi, tb = b->frac_hi;
 if (ta == tb) {
 ta = a->frac_lo, tb = b->frac_lo;
 if (ta == tb) {
-return 0;
+return float_relation_equal;
 }
 }
-return ta < tb ? -1 : 1;
+return ta < tb ? float_relation_less : float_relation_greater;
 }
 
 #define frac_cmp(A, B)  FRAC_GENERIC_64_128(cmp, A)(A, B)
-- 
2.34.1




[PULL 4/6] softfloat: Fix declaration of partsN_compare

2022-04-26 Thread Richard Henderson
The declaration used 'int', while the definition used 'FloatRelation'.
This should have resulted in a compiler error, but mysteriously didn't.

Signed-off-by: Richard Henderson 
Reviewed-by: Peter Maydell 
Message-Id: <20220401132240.79730-2-richard.hender...@linaro.org>
---
 fpu/softfloat.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fpu/softfloat.c b/fpu/softfloat.c
index 5e2cf20448..e7d7ad56bc 100644
--- a/fpu/softfloat.c
+++ b/fpu/softfloat.c
@@ -874,10 +874,10 @@ static FloatParts128 *parts128_minmax(FloatParts128 *a, 
FloatParts128 *b,
 #define parts_minmax(A, B, S, F) \
 PARTS_GENERIC_64_128(minmax, A)(A, B, S, F)
 
-static int parts64_compare(FloatParts64 *a, FloatParts64 *b,
-   float_status *s, bool q);
-static int parts128_compare(FloatParts128 *a, FloatParts128 *b,
-float_status *s, bool q);
+static FloatRelation parts64_compare(FloatParts64 *a, FloatParts64 *b,
+ float_status *s, bool q);
+static FloatRelation parts128_compare(FloatParts128 *a, FloatParts128 *b,
+  float_status *s, bool q);
 
 #define parts_compare(A, B, S, Q) \
 PARTS_GENERIC_64_128(compare, A)(A, B, S, Q)
-- 
2.34.1




[PULL 3/6] target/i386: Suppress coverity warning on fsave/frstor

2022-04-26 Thread Richard Henderson
Coverity warns that 14 << data32 may overflow with respect
to the target_ulong to which it is subsequently added.
We know this wasn't true because data32 is in [1,2],
but the suggested fix is perfectly fine.

Fixes: Coverity CID 1487135, 1487256
Signed-off-by: Richard Henderson 
Reviewed-by: Damien Hedde 
Message-Id: <20220401184635.327423-1-richard.hender...@linaro.org>
---
 target/i386/tcg/fpu_helper.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/target/i386/tcg/fpu_helper.c b/target/i386/tcg/fpu_helper.c
index ebf5e73df9..30bc44fcf8 100644
--- a/target/i386/tcg/fpu_helper.c
+++ b/target/i386/tcg/fpu_helper.c
@@ -2466,7 +2466,7 @@ static void do_fsave(CPUX86State *env, target_ulong ptr, 
int data32,
 
 do_fstenv(env, ptr, data32, retaddr);
 
-ptr += (14 << data32);
+ptr += (target_ulong)14 << data32;
 for (i = 0; i < 8; i++) {
 tmp = ST(i);
 do_fstt(env, tmp, ptr, retaddr);
@@ -2488,7 +2488,7 @@ static void do_frstor(CPUX86State *env, target_ulong ptr, 
int data32,
 int i;
 
 do_fldenv(env, ptr, data32, retaddr);
-ptr += (14 << data32);
+ptr += (target_ulong)14 << data32;
 
 for (i = 0; i < 8; i++) {
 tmp = do_fldt(env, ptr, retaddr);
-- 
2.34.1




[PULL 2/6] target/s390x: Fix the accumulation of ccm in op_icm

2022-04-26 Thread Richard Henderson
Coverity rightly reports that 0xff << pos can overflow.
This would affect the ICMH instruction.

Fixes: Coverity CID 1487161
Signed-off-by: Richard Henderson 
Reviewed-by: David Hildenbrand 
Reviewed-by: Thomas Huth 
Message-Id: <20220401193659.332079-1-richard.hender...@linaro.org>
---
 target/s390x/tcg/translate.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/s390x/tcg/translate.c b/target/s390x/tcg/translate.c
index ae70368966..8f092dab95 100644
--- a/target/s390x/tcg/translate.c
+++ b/target/s390x/tcg/translate.c
@@ -2622,7 +2622,7 @@ static DisasJumpType op_icm(DisasContext *s, DisasOps *o)
 tcg_gen_qemu_ld8u(tmp, o->in2, get_mem_index(s));
 tcg_gen_addi_i64(o->in2, o->in2, 1);
 tcg_gen_deposit_i64(o->out, o->out, tmp, pos, 8);
-ccm |= 0xff << pos;
+ccm |= 0xffull << pos;
 }
 m3 = (m3 << 1) & 0xf;
 pos -= 8;
-- 
2.34.1




[PULL 0/6] Coverity patch queue

2022-04-26 Thread Richard Henderson
The following changes since commit a72d9008092e39c2c37e47a91bae4e170d0f1b33:

  Merge tag 'pull-nbd-2022-04-26' of https://repo.or.cz/qemu/ericb into staging 
(2022-04-26 14:39:09 -0700)

are available in the Git repository at:

  https://gitlab.com/rth7680/qemu.git tags/pull-tcg-20220426

for you to fetch changes up to dee3fcfbb399a0e4ccedbf737b5b0b7f56ecd398:

  softfloat: Use FloatRelation for fracN_cmp (2022-04-26 20:01:55 -0700)


Fix s390x ICMH cc computation.
Minor adjustments to satisfy Coverity.


Richard Henderson (6):
  accel/tcg: Assert mmu_idx in range before use in cputlb
  target/s390x: Fix the accumulation of ccm in op_icm
  target/i386: Suppress coverity warning on fsave/frstor
  softfloat: Fix declaration of partsN_compare
  softfloat: Use FloatRelation within partsN_compare
  softfloat: Use FloatRelation for fracN_cmp

 accel/tcg/cputlb.c   | 40 +++-
 fpu/softfloat.c  | 20 +++-
 target/i386/tcg/fpu_helper.c |  4 ++--
 target/s390x/tcg/translate.c |  2 +-
 fpu/softfloat-parts.c.inc| 11 +++
 5 files changed, 48 insertions(+), 29 deletions(-)



[PATCH qemu] spapr_pci: Disable IRQFD resampling on XIVE

2022-04-26 Thread Alexey Kardashevskiy
VFIO-PCI has an "KVM_IRQFD_FLAG_RESAMPLE" optimization for INTx EOI
handling when KVM can unmask PCI INTx (level triggered interrupt) without
switching to the userspace (==QEMU).

Unfortunately XIVE does not support level interrupts, QEMU emulates them
and therefore there is no existing code path to kick the resamplefd.
The problem appears when passing through a PCI adapter with
the "pci=nomsi" kernel parameter - the adapter's interrupt interrupt
count in /proc/interrupts will stuck at "1".

This disables resampler when the XIVE interrupt controller is configured.
This should not be very visible though KVM already exits to QEMU for INTx
and XIVE-capable boxes (POWER9 and newer) do not seem to have
performance-critical INTx-only capable devices.

Signed-off-by: Alexey Kardashevskiy 
---


Cédric, this is what I meant when I said that spapr_pci.c was unaware of
the interrupt controller type, neither xics nor xive was mentioned
in the file before.


---
 hw/ppc/spapr_pci.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 5bfd4aa9e5aa..2675052601db 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -729,11 +729,19 @@ static void pci_spapr_set_irq(void *opaque, int irq_num, 
int level)
 
 static PCIINTxRoute spapr_route_intx_pin_to_irq(void *opaque, int pin)
 {
+SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
 SpaprPhbState *sphb = SPAPR_PCI_HOST_BRIDGE(opaque);
-PCIINTxRoute route;
+PCIINTxRoute route = { .mode = PCI_INTX_DISABLED };
 
-route.mode = PCI_INTX_ENABLED;
-route.irq = sphb->lsi_table[pin].irq;
+/*
+ * Disable IRQFD resampler on XIVE as it does not support LSI and QEMU
+ * emulates those so the KVM kernel resamplefd kick is skipped and EOI
+ * is not delivered to VFIO-PCI.
+ */
+if (!spapr->xive) {
+route.mode = PCI_INTX_ENABLED;
+route.irq = sphb->lsi_table[pin].irq;
+}
 
 return route;
 }
-- 
2.30.2




Re: [PATCH 0/7] vhost-vdpa multiqueue fixes

2022-04-26 Thread Jason Wang



在 2022/3/30 14:33, Si-Wei Liu 写道:

Hi,

This patch series attempt to fix a few issues in vhost-vdpa multiqueue 
functionality.

Patch #1 is the formal submission for RFC patch in:
https://lore.kernel.org/qemu-devel/c3e931ee-1a1b-9c2f-2f59-cb4395c23...@oracle.com/

Patch #2 and #3 were taken from a previous patchset posted on qemu-devel:
https://lore.kernel.org/qemu-devel/2027192851.65529-1-epere...@redhat.com/

albeit abandoned, two patches in that set turn out to be useful for patch #4, 
which is to fix a QEMU crash due to race condition.

Patch #5 through #7 are obviously small bug fixes. Please find the description 
of each in the commit log.

Thanks,
-Siwei



Hi Si Wei:

I think we need another version of this series?

Thanks




---

Eugenio Pérez (2):
   virtio-net: Fix indentation
   virtio-net: Only enable userland vq if using tap backend

Si-Wei Liu (5):
   virtio-net: align ctrl_vq index for non-mq guest for vhost_vdpa
   virtio: don't read pending event on host notifier if disabled
   vhost-vdpa: fix improper cleanup in net_init_vhost_vdpa
   vhost-net: fix improper cleanup in vhost_net_start
   vhost-vdpa: backend feature should set only once

  hw/net/vhost_net.c |  4 +++-
  hw/net/virtio-net.c| 25 +
  hw/virtio/vhost-vdpa.c |  2 +-
  hw/virtio/virtio-bus.c |  3 ++-
  hw/virtio/virtio.c | 21 +
  include/hw/virtio/virtio.h |  2 ++
  net/vhost-vdpa.c   |  4 +++-
  7 files changed, 45 insertions(+), 16 deletions(-)






[PATCH] target/arm: Enable SCTLR_EL1.BT0 for aarch64-linux-user

2022-04-26 Thread Richard Henderson
This controls whether the PACI{A,B}SP instructions trap with BTYPE=3
(indirect branch from register other than x16/x17).  The linux kernel
sets this in bti_enable().

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/998
Signed-off-by: Richard Henderson 
---
 target/arm/cpu.c  |  2 ++
 tests/tcg/aarch64/bti-3.c | 42 +++
 tests/tcg/aarch64/Makefile.target |  8 +++---
 3 files changed, 48 insertions(+), 4 deletions(-)
 create mode 100644 tests/tcg/aarch64/bti-3.c

diff --git a/target/arm/cpu.c b/target/arm/cpu.c
index e3f8215203..c50a8dca0b 100644
--- a/target/arm/cpu.c
+++ b/target/arm/cpu.c
@@ -197,6 +197,8 @@ static void arm_cpu_reset(DeviceState *dev)
 /* Enable all PAC keys.  */
 env->cp15.sctlr_el[1] |= (SCTLR_EnIA | SCTLR_EnIB |
   SCTLR_EnDA | SCTLR_EnDB);
+/* Trap on btype=3 for PACIxSP. */
+env->cp15.sctlr_el[1] |= SCTLR_BT0;
 /* and to the FP/Neon instructions */
 env->cp15.cpacr_el1 = deposit64(env->cp15.cpacr_el1, 20, 2, 3);
 /* and to the SVE instructions */
diff --git a/tests/tcg/aarch64/bti-3.c b/tests/tcg/aarch64/bti-3.c
new file mode 100644
index 00..a852856d9a
--- /dev/null
+++ b/tests/tcg/aarch64/bti-3.c
@@ -0,0 +1,42 @@
+/*
+ * BTI vs PACIASP
+ */
+
+#include "bti-crt.inc.c"
+
+static void skip2_sigill(int sig, siginfo_t *info, ucontext_t *uc)
+{
+uc->uc_mcontext.pc += 8;
+uc->uc_mcontext.pstate = 1;
+}
+
+#define BTYPE_1() \
+asm("mov %0,#1; adr x16, 1f; br x16; 1: hint #25; mov %0,#0" \
+: "=r"(skipped) : : "x16", "x30")
+
+#define BTYPE_2() \
+asm("mov %0,#1; adr x16, 1f; blr x16; 1: hint #25; mov %0,#0" \
+: "=r"(skipped) : : "x16", "x30")
+
+#define BTYPE_3() \
+asm("mov %0,#1; adr x15, 1f; br x15; 1: hint #25; mov %0,#0" \
+: "=r"(skipped) : : "x15", "x30")
+
+#define TEST(WHICH, EXPECT) \
+do { WHICH(); fail += skipped ^ EXPECT; } while (0)
+
+int main()
+{
+int fail = 0;
+int skipped;
+
+/* Signal-like with SA_SIGINFO.  */
+signal_info(SIGILL, skip2_sigill);
+
+/* With SCTLR_EL1.BT0 set, PACIASP is not compatible with type=3. */
+TEST(BTYPE_1, 0);
+TEST(BTYPE_2, 0);
+TEST(BTYPE_3, 1);
+
+return fail;
+}
diff --git a/tests/tcg/aarch64/Makefile.target 
b/tests/tcg/aarch64/Makefile.target
index 6ad0ad49f9..a738eb137c 100644
--- a/tests/tcg/aarch64/Makefile.target
+++ b/tests/tcg/aarch64/Makefile.target
@@ -26,11 +26,11 @@ run-plugin-pauth-%: QEMU_OPTS += -cpu max
 endif
 
 # BTI Tests
-# bti-1 tests the elf notes, so we require special compiler support.
+# bti-1 test the elf notes, so we require special compiler support.
 ifneq ($(CROSS_CC_HAS_ARMV8_BTI),)
-AARCH64_TESTS += bti-1
-bti-1: CFLAGS += -mbranch-protection=standard
-bti-1: LDFLAGS += -nostdlib
+AARCH64_TESTS += bti-1 bti-3
+bti-1 bti-3: CFLAGS += -mbranch-protection=standard
+bti-1 bti-3: LDFLAGS += -nostdlib
 endif
 # bti-2 tests PROT_BTI, so no special compiler support required.
 AARCH64_TESTS += bti-2
-- 
2.34.1




Re: qemu questions about x86

2022-04-26 Thread Wei Li
Dear 项晨东

On Sat, Apr 23, 2022 at 3:57 PM 项晨东  wrote:

> Dear qemu developers:
> hello~ I'm Xiang Chen dong, a student from Tsinghua University. recently I
> am trying to  accomplish new X86 feature named user-interrupts which can
> view here
> 
> .
> I worked for a couple of time, reaching status that new msrs added and
> access of msrs is work well, also add new CPUID infos to qemu64, also I
> could catch new instructions by modify `translate.c` file. my code could
> find here , the correspond linux kernel
> version could find here .
> but now I have some problems when trying to accomplish instructions named
> SENDUIPI and UIRET.
> for SENDUIPI, the main function of it is sending the user-interrupts. the
> detail way is, machine access memory(address saved in new msr), then read
> another address from memory, then write some content to this memory. I read
> the qemu source code, find a lot of functions like tcg_gen_qemu_ld,  but
> when i click into it from IDE(vscode), i could not find where the function
> body(maybe due to the macro). So I don't understand how the function works
> and how can I wirte a new function to access guest machine memory and write
> back in qemu.
>

tcg_frontend: gen_op_ld_v-->tcg_gen_qemu_ld_tl-->tcg_gen_qemu_ld_i64
(tcg/tcg-op.c)-->gen_ldst_i64
tcg_backend: case INDEX_op_qemu_ld_i64:-->tcg_out_qemu_ld
(tcg-target.c.inc tcg/i386)
You only need to focus on the frontend and learn from how to translate
other instructions.

another problem is that I am not quite get the idea of accomplishment of
> Interrupt, i could find functions like raise_interrupt and raise_exception,
> but I don't understand how it interact with apic(how the control flow
> switched to other functions, i find cpu_loop_exit_restore, but can not find
> the function body), either how the interrupt handled.
>

hardware interrupt produce
pc_i8259_create-->i8259_init-->x86_allocate_cpu_irq-->pic_irq_request
pic_irq_request-->cpu_interrupt(cs, CPU_INTERRUPT_HARD)
-->softmmu/cpus.c/cpu_interrupt-->tcg_handle_interrupt
  -->cpu_reset_interrupt-->hw/core/cpu-common.c/cpu_reset_interrupt

 hardware interrupt handle
cpu_exec-->cpu_handle_interrupt-->cc->tcg_ops->cpu_exec_interrupt-->x86_cpu_exec_interrupt
-->cpu_get_pic_interrupt-->pic_read_irq
-->do_interrupt_x86_hardirq-->do_interrupt_all-->do_interrupt_protected-->
use siglongjmp or sigsetjmp

exception handle
cpu_handle_exception-->cc->tcg_ops->fake_user_interrupt-->x86_cpu_do_interrupt-->do_interrupt_all


>
>
the problem is difficult in some ways, I discussed with my classmates and
> friends, but there is no answer.
> so I'm hoping to get important information from you. Is my way of reading
> code right? Is there any tools for development(finding the function
> body)?How can I accomplish this quickly?
> thank you very very much!
> best wishes!
> Xiang Chen Dong
>

Everything here maybe have some mistakes.
Hope it is useful for you.
-- 
best wishes!

Wei Li


Re: [PATCH qemu 1/9] target/riscv: rvv: Add mask agnostic for vv instructions

2022-04-26 Thread eop Chen


> Weiwei Li  於 2022年4月27日 上午11:27 寫道:
> 
> 
> 
> 在 2022/4/27 上午10:07, eop Chen 写道:
>> 
>> 
>>> 
>>> 在 2022/4/27 上午2:20, eop Chen 写道:
 
> Weiwei Li mailto:liwei...@iscas.ac.cn>> 於 
> 2022年4月26日 下午4:47 寫道:
> 在 2022/3/17 下午3:26, ~eopxd 写道:
>> From: Yueh-Ting (eop) Chen  
>> 
>> 
>> This is the first commit regarding the mask agnostic behavior.
>> Added option 'rvv_ma_all_1s' to enable the behavior, the option
>> is default to false.
>> 
>> Signed-off-by: eop Chen  
>> 
>> Reviewed-by: Frank Chang  
>> 
> Similar to our last discussion,  vext_set_elems_1s_fns array can be 
> simplified to single vext_set_elems_1s,
> 
> since the fourth argement can be used as the start offset. 
> 
> Another question, may be not related to this patchset, in section 3.4.3 
> of the spec: 
> 
> "Mask destination tail elements are always treated as tail-agnostic, 
> regardless of the setting of vta."
> 
> What does "Mask destination tail elements" mean?
> 
> Regards,
> 
> Weiwei Li
> 
 
 
 I have just updated a new version for the tail agnostic patch set, it also 
 includes a bug fix discovered today.
 
 Regarding the question, mask / masked-off are for body elements and 
 respects the mask policy, and tail elements respect the tail policy.
 
 Regards,
 
 eop Chen
>>> 
>>> I find another descriptions in the spec. For the instructions that write 
>>> mask register (such as vmadc, vmseq,vmsne,vmfeq...), they all have similar 
>>> description
>>> 
>>> (You can search "tail-agnostic" in the spec):
>>> 
>>> Section 11.4: "Because these instructions produce a mask value, they always 
>>> operate with a tail-agnostic policy"
>>> 
>>> Section 11.8/13.13: "Compares write mask registers, and so always operate 
>>> under a tail-agnostic policy"
>>> 
>>> Section 15.1: "Mask elements past vl, the tail elements, are always updated 
>>> with a tail-agnostic policy"
>>> 
>>> Section 15.4/15.5/15.6: "The tail elements in the destination mask register 
>>> are updated under a tail-agnostic policy"
>>> 
>>> So I think "Mask destination tail elements" may means the tail element for 
>>> instructions that take mask register as destination register.  For these 
>>> instructions, 
>>> 
>>> maybe the setting of VTA can be ignored.  
>>> 
>>> Regards,
>>> 
>>> Weiwei Li
>>> 
>> 
>> Yes, the setting of VTA should be ignored when v-spec specifies.
>> I think the implementation of the tail agnostic patch set don’t need to 
>> change on this.
> Sorry. I don't get your idea? 
> 
> In current patch, these instructions seems need to set the tail elements to 
> 1s when vta is true which means
> 
> VTA is setted and rvv_ta_all_1s is enabled. If setting of VTA should be 
> ignored for these instrucitons,  
> 
> they will set the tail elements to 1s only when rvv_ta_all_1s is enabled.
> 
> Regards,
> 
> Weiwei Li
> 
>> 
>> Regards,
>> 
>> eop Chen
>> 
>> 
>> 

I see your point now. Yes the implementation should be changed.
Let me submit a new version for this.

Regards,

eop Chen




Re: [PATCH qemu 1/9] target/riscv: rvv: Add mask agnostic for vv instructions

2022-04-26 Thread Weiwei Li


在 2022/4/27 上午10:07, eop Chen 写道:





在 2022/4/27 上午2:20, eop Chen 写道:


Weiwei Li mailto:liwei...@iscas.ac.cn>> 於 
2022年4月26日 下午4:47 寫道:

在 2022/3/17 下午3:26, ~eopxd 写道:

From: Yueh-Ting (eop) Chen

This is the first commit regarding the mask agnostic behavior.
Added option 'rvv_ma_all_1s' to enable the behavior, the option
is default to false.

Signed-off-by: eop Chen
Reviewed-by: Frank Chang


Similar to our last discussion, vext_set_elems_1s_fns array can be 
simplified to single vext_set_elems_1s,


since the fourth argement can be used as the start offset.

Another question, may be not related to this patchset, in section 
3.4.3 of the spec:


/"Mask destination tail elements are always treated as 
tail-agnostic, regardless of the setting of vta."/


What does "Mask destination tail elements" mean?

Regards,

Weiwei Li



I have just updated a new version for the tail agnostic patch set, 
it also includes a bug fix discovered today.


Regarding the question, mask / masked-off are for body elements and 
respects the mask policy, and tail elements respect the tail policy.


Regards,

eop Chen


I find another descriptions in the spec. For the instructions that 
write mask register (such as vmadc, vmseq,vmsne,vmfeq...), they all 
have similar description


(You can search "tail-agnostic" in the spec):

/Section 11.4: "Because these instructions produce a mask value, they 
always operate with a tail-agnostic policy"//

/

/Section 11.8/13.13: "Compares //write mask registers, and so always 
operate under a tail-agnostic policy"//

/

/Section 15.1: "Mask elements past vl, the tail elements, are always 
updated with a tail-agnostic policy"//

/

//

/Section 15.4/15.5/15.6: "The tail elements in the destination mask 
register are updated under a tail-agnostic policy"/


So I think "Mask destination tail elements" may means the tail 
element for instructions that take mask register as destination 
register.  For these instructions,


maybe the setting of VTA can be ignored.

Regards,

Weiwei Li



Yes, the setting of VTA should be ignored when v-spec specifies.
I think the implementation of the tail agnostic patch set don’t need 
to change on this.


Sorry. I don't get your idea?

In current patch, these instructions seems need to set the tail elements 
to 1s when vta is true which means


VTA is setted and rvv_ta_all_1s is enabled. If setting of VTA should be 
ignored for these instrucitons,


they will set the tail elements to 1s only when rvv_ta_all_1s is enabled.

Regards,

Weiwei Li



Regards,

eop Chen





Re: [PATCH v4 4/6] vduse-blk: implements vduse-blk export

2022-04-26 Thread Yongji Xie
On Wed, Apr 27, 2022 at 1:03 AM Kevin Wolf  wrote:
>
> Am 06.04.2022 um 09:59 hat Xie Yongji geschrieben:
> > This implements a VDUSE block backends based on
> > the libvduse library. We can use it to export the BDSs
> > for both VM and container (host) usage.
> >
> > The new command-line syntax is:
> >
> > $ qemu-storage-daemon \
> > --blockdev file,node-name=drive0,filename=test.img \
> > --export vduse-blk,node-name=drive0,id=vduse-export0,writable=on
> >
> > After the qemu-storage-daemon started, we need to use
> > the "vdpa" command to attach the device to vDPA bus:
> >
> > $ vdpa dev add name vduse-export0 mgmtdev vduse
> >
> > Also the device must be removed via the "vdpa" command
> > before we stop the qemu-storage-daemon.
> >
> > Signed-off-by: Xie Yongji 
>
> The request handling code is almos the same as for the vhost-user-blk
> export. I wonder if we could share this code instead of copying.
>

I think we can. Will do it v5.

> The main difference seems to be that you chose not to support discard
> and write_zeroes yet. I'm curious if there is a reason why the
> vhost-user-blk code wouldn't work for vdpa there?
>

They are different protocols. The data plane is similar, so we can
share some codes. But the control plane is different, e.g., vhost-user
can only work for guests but vdpa can work for both guests and hosts.

> > +features = vduse_get_virtio_features() |
> > +   (1ULL << VIRTIO_BLK_F_SIZE_MAX) |
> > +   (1ULL << VIRTIO_BLK_F_SEG_MAX) |
> > +   (1ULL << VIRTIO_BLK_F_TOPOLOGY) |
> > +   (1ULL << VIRTIO_BLK_F_BLK_SIZE);
> > +
> > +if (num_queues > 1) {
> > +features |= 1ULL << VIRTIO_BLK_F_MQ;
> > +}
> > +if (!vblk_exp->writable) {
> > +features |= 1ULL << VIRTIO_BLK_F_RO;
> > +}
>
> VIRTIO_BLK_F_FLUSH seems to be missing even though the flush command is
> implemented.
>

Oops. Will fix it.

> (This is not a full review yet, just two or three things I noticed while
> having a quick look.)
>

Thank you for your time!

Thanks,
Yongji



Re: [PATCH 0/3] softfloat: FloatRelation cleanups

2022-04-26 Thread Richard Henderson

On 4/1/22 06:22, Richard Henderson wrote:

Make consistent use of FloatRelation throughout the
implementation of the float compare functions.

I don't yet know if this actually solves Coverity issues,
but they all look a bit cleaner, I think.

r~

Richard Henderson (3):
   softfloat: Fix declaration of partsN_compare
   softfloat: Use FloatRelation within partsN_compare
   softfloat: Use FloatRelation for fracN_cmp

  fpu/softfloat.c   | 20 +++-
  fpu/softfloat-parts.c.inc | 11 +++
  2 files changed, 18 insertions(+), 13 deletions(-)



Queuing to tcg-next.


r~



Re: [PATCH] accel/tcg: Assert mmu_idx in range before use in cputlb

2022-04-26 Thread Richard Henderson

On 4/1/22 10:08, Richard Henderson wrote:

Coverity reports out-of-bound accesses within cputlb.c.
This should be a false positive due to how the index is
decoded from MemOpIdx.  To be fair, nothing is checking
the correct bounds during encoding either.

Assert index in range before use, both to catch user errors
and to pacify static analysis.

Fixes: Coverity CID 1487120, 1487127, 1487170, 1487196, 1487215, 1487238
Signed-off-by: Richard Henderson 


Queuing to tcg-next.

r~



Re: [PATCH] target/i386: Suppress coverity warning on fsave/frstor

2022-04-26 Thread Richard Henderson

On 4/1/22 11:46, Richard Henderson wrote:

Coverity warns that 14 << data32 may overflow with respect
to the target_ulong to which it is subsequently added.
We know this wasn't true because data32 is in [1,2],
but the suggested fix is perfectly fine.

Fixes: Coverity CID 1487135, 1487256
Signed-off-by: Richard Henderson 
---
  target/i386/tcg/fpu_helper.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/target/i386/tcg/fpu_helper.c b/target/i386/tcg/fpu_helper.c
index ebf5e73df9..30bc44fcf8 100644
--- a/target/i386/tcg/fpu_helper.c
+++ b/target/i386/tcg/fpu_helper.c
@@ -2466,7 +2466,7 @@ static void do_fsave(CPUX86State *env, target_ulong ptr, 
int data32,
  
  do_fstenv(env, ptr, data32, retaddr);
  
-ptr += (14 << data32);

+ptr += (target_ulong)14 << data32;
  for (i = 0; i < 8; i++) {
  tmp = ST(i);
  do_fstt(env, tmp, ptr, retaddr);
@@ -2488,7 +2488,7 @@ static void do_frstor(CPUX86State *env, target_ulong ptr, 
int data32,
  int i;
  
  do_fldenv(env, ptr, data32, retaddr);

-ptr += (14 << data32);
+ptr += (target_ulong)14 << data32;
  
  for (i = 0; i < 8; i++) {

  tmp = do_fldt(env, ptr, retaddr);


Queuing to tcg-next.


r~



Re: [PATCH] target/s390x: Fix the accumulation of ccm in op_icm

2022-04-26 Thread Richard Henderson

On 4/1/22 12:36, Richard Henderson wrote:

Coverity rightly reports that 0xff << pos can overflow.
This would affect the ICMH instruction.

Fixes: Coverity CID 1487161
Signed-off-by: Richard Henderson 
---
  target/s390x/tcg/translate.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/s390x/tcg/translate.c b/target/s390x/tcg/translate.c
index 5acfc0ff9b..ea7baf0832 100644
--- a/target/s390x/tcg/translate.c
+++ b/target/s390x/tcg/translate.c
@@ -2622,7 +2622,7 @@ static DisasJumpType op_icm(DisasContext *s, DisasOps *o)
  tcg_gen_qemu_ld8u(tmp, o->in2, get_mem_index(s));
  tcg_gen_addi_i64(o->in2, o->in2, 1);
  tcg_gen_deposit_i64(o->out, o->out, tmp, pos, 8);
-ccm |= 0xff << pos;
+ccm |= 0xffull << pos;
  }
  m3 = (m3 << 1) & 0xf;
  pos -= 8;


Queuing to tcg-next.


r~



[PATCH v2] linux-user: Clean up arg_start/arg_end confusion

2022-04-26 Thread Richard Henderson
We had two sets of variables: arg_start/arg_end, and
arg_strings/env_strings.  In linuxload.c, we set the
first pair to the bounds of the argv strings, but in
elfload.c, we set the first pair to the bounds of the
argv pointers and the second pair to the bounds of
the argv strings.

Remove arg_start/arg_end, replacing them with the standard
argc/argv/envc/envp values.  Retain arg_strings/env_strings
with the meaning we were using in elfload.c.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/714
Reviewed-by: Peter Maydell 
Signed-off-by: Richard Henderson 
---
 linux-user/qemu.h | 12 
 linux-user/elfload.c  | 10 ++
 linux-user/linuxload.c| 12 ++--
 linux-user/main.c |  4 ++--
 semihosting/arm-compat-semi.c |  4 ++--
 5 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/linux-user/qemu.h b/linux-user/qemu.h
index 46550f5e21..7d90de1b15 100644
--- a/linux-user/qemu.h
+++ b/linux-user/qemu.h
@@ -40,15 +40,19 @@ struct image_info {
 abi_ulong   data_offset;
 abi_ulong   saved_auxv;
 abi_ulong   auxv_len;
-abi_ulong   arg_start;
-abi_ulong   arg_end;
-abi_ulong   arg_strings;
-abi_ulong   env_strings;
+abi_ulong   argc;
+abi_ulong   argv;
+abi_ulong   envc;
+abi_ulong   envp;
 abi_ulong   file_string;
 uint32_telf_flags;
 int personality;
 abi_ulong   alignment;
 
+/* Generic semihosting knows about these pointers. */
+abi_ulong   arg_strings;   /* strings for argv */
+abi_ulong   env_strings;   /* strings for envp; ends arg_strings */
+
 /* The fields below are used in FDPIC mode.  */
 abi_ulong   loadmap_addr;
 uint16_tnsegs;
diff --git a/linux-user/elfload.c b/linux-user/elfload.c
index 61063fd974..8c0765dd4b 100644
--- a/linux-user/elfload.c
+++ b/linux-user/elfload.c
@@ -1516,8 +1516,8 @@ static inline void init_thread(struct target_pt_regs 
*regs,
 regs->iaoq[0] = infop->entry;
 regs->iaoq[1] = infop->entry + 4;
 regs->gr[23] = 0;
-regs->gr[24] = infop->arg_start;
-regs->gr[25] = (infop->arg_end - infop->arg_start) / sizeof(abi_ulong);
+regs->gr[24] = infop->argv;
+regs->gr[25] = infop->argc;
 /* The top-of-stack contains a linkage buffer.  */
 regs->gr[30] = infop->start_stack + 64;
 regs->gr[31] = infop->entry;
@@ -2120,8 +2120,10 @@ static abi_ulong create_elf_tables(abi_ulong p, int 
argc, int envc,
 u_envp = u_argv + (argc + 1) * n;
 u_auxv = u_envp + (envc + 1) * n;
 info->saved_auxv = u_auxv;
-info->arg_start = u_argv;
-info->arg_end = u_argv + argc * n;
+info->argc = argc;
+info->envc = envc;
+info->argv = u_argv;
+info->envp = u_envp;
 
 /* This is correct because Linux defines
  * elf_addr_t as Elf32_Off / Elf64_Off
diff --git a/linux-user/linuxload.c b/linux-user/linuxload.c
index 2ed5fc45ed..745cce70ab 100644
--- a/linux-user/linuxload.c
+++ b/linux-user/linuxload.c
@@ -92,6 +92,11 @@ abi_ulong loader_build_argptr(int envc, int argc, abi_ulong 
sp,
 envp = sp;
 sp -= (argc + 1) * n;
 argv = sp;
+ts->info->envp = envp;
+ts->info->envc = envc;
+ts->info->argv = argv;
+ts->info->argc = argc;
+
 if (push_ptr) {
 /* FIXME - handle put_user() failures */
 sp -= n;
@@ -99,19 +104,22 @@ abi_ulong loader_build_argptr(int envc, int argc, 
abi_ulong sp,
 sp -= n;
 put_user_ual(argv, sp);
 }
+
 sp -= n;
 /* FIXME - handle put_user() failures */
 put_user_ual(argc, sp);
-ts->info->arg_start = stringp;
+
+ts->info->arg_strings = stringp;
 while (argc-- > 0) {
 /* FIXME - handle put_user() failures */
 put_user_ual(stringp, argv);
 argv += n;
 stringp += target_strlen(stringp) + 1;
 }
-ts->info->arg_end = stringp;
 /* FIXME - handle put_user() failures */
 put_user_ual(0, argv);
+
+ts->info->env_strings = stringp;
 while (envc-- > 0) {
 /* FIXME - handle put_user() failures */
 put_user_ual(stringp, envp);
diff --git a/linux-user/main.c b/linux-user/main.c
index 7ca48664e4..651e32f5f2 100644
--- a/linux-user/main.c
+++ b/linux-user/main.c
@@ -878,9 +878,9 @@ int main(int argc, char **argv, char **envp)
 fprintf(f, "entry   0x" TARGET_ABI_FMT_lx "\n",
 info->entry);
 fprintf(f, "argv_start  0x" TARGET_ABI_FMT_lx "\n",
-info->arg_start);
+info->argv);
 fprintf(f, "env_start   0x" TARGET_ABI_FMT_lx "\n",
-info->arg_end + (abi_ulong)sizeof(abi_ulong));
+info->envp);
 fprintf(f, "auxv_start  0x" TARGET_ABI_FMT_lx "\n",
 info->saved_auxv);
 qemu_log_

[PATCH] 9pfs: local: Do not follow symlink in _nofollow

2022-04-26 Thread Akihiko Odaki
Signed-off-by: Akihiko Odaki 
---
 hw/9pfs/9p-local.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/9pfs/9p-local.c b/hw/9pfs/9p-local.c
index d42ce6d8b82..def8afdb4d6 100644
--- a/hw/9pfs/9p-local.c
+++ b/hw/9pfs/9p-local.c
@@ -365,7 +365,7 @@ static int fchmodat_nofollow(int dirfd, const char *name, 
mode_t mode)
 if (fd == -1) {
 /* In case the file is writable-only and isn't a directory. */
 if (errno == EACCES) {
-fd = openat_file(dirfd, name, O_WRONLY, 0);
+fd = openat_file(dirfd, name, O_WRONLY | O_NOFOLLOW, 0);
 }
 if (fd == -1 && errno == EISDIR) {
 errno = EACCES;
-- 
2.32.0 (Apple Git-132)




Re: [PATCH v2 2/5] migration: Add 64bit variable array data type

2022-04-26 Thread Atish Kumar Patra
On Tue, Apr 26, 2022 at 5:50 PM Richard Henderson
 wrote:
>
> On 4/26/22 16:08, Atish Patra wrote:
> > +.num_offset = vmstate_offset_value(_state, _field_num, uint32_t),\
> ...
> >   } else if (field->flags & VMS_VARRAY_UINT32) {
> >   n_elems = *(uint32_t *)(opaque + field->num_offset);
> > +} else if (field->flags & VMS_VARRAY_UINT64) {
> > +n_elems = *(uint64_t *)(opaque + field->num_offset);
>
> Offset type mismatch.  Since num_harts is uint32_t, I don't believe you need 
> this patch at
> all.
>

Ahh yes. You are right. I read the function description wrong and
assumed that VMSTATE_VARRAY_UINT32
creates an variable array of uint32_t elements only. Thanks for the
clarification.

>
> r~



Re: [PATCH v2 1/5] hw/intc: Move mtimer/mtimecmp to aclint

2022-04-26 Thread Atish Kumar Patra
On Tue, Apr 26, 2022 at 5:50 PM Richard Henderson
 wrote:
>
> On 4/26/22 16:08, Atish Patra wrote:
> > @@ -334,7 +334,6 @@ const VMStateDescription vmstate_riscv_cpu = {
> >   VMSTATE_UINTTL(env.mscratch, RISCVCPU),
> >   VMSTATE_UINT64(env.mfromhost, RISCVCPU),
> >   VMSTATE_UINT64(env.mtohost, RISCVCPU),
> > -VMSTATE_UINT64(env.timecmp, RISCVCPU),
> >
>
> Must bump version_id and minimum_version_id.
>

Yeah. Fixed that. Thanks.

> r~



Re: [PATCH v2 2/5] 9pfs: fix qemu_mknodat(S_IFSOCK) on macOS

2022-04-26 Thread Akihiko Odaki

On 2022/04/26 21:38, Greg Kurz wrote:

On Tue, 26 Apr 2022 12:57:37 +0900
Akihiko Odaki  wrote:


On 2022/04/25 3:45, Christian Schoenebeck wrote:

+}
+err = chmod(addr.sun_path, mode);


I'm not sure if it is fine to have a time window between bind() and
chmod(). Do you have some rationale?


Good question. QEMU's 9p server is multi-threaded; all 9p requests come in
serialized and the 9p server controller portion (9p.c) is only running on
QEMU main thread, but the actual filesystem driver calls are then
dispatched to QEMU worker threads and therefore running concurrently at
this point:

https://wiki.qemu.org/Documentation/9p#Threads_and_Coroutines

Similar situation on Linux 9p client side: it handles access to a mounted
9p filesystem concurrently, requests are then serialized by 9p driver on
Linux and sent over wire to 9p server (host).

So yes, there might be implications by that short time windows. But could
that be exploited on macOS hosts in practice?

The socket file would have mode srwxr-xr-x for a short moment.

For security_model=mapped* this should not be a problem.

For security_model=none|passhrough, in theory, maybe? But how likely is
that? If you are using a Linux client for instance, trying to brute-force
opening the socket file, the client would send several 9p commands
(Twalk, Tgetattr, Topen, probably more). The time window of the two
commands above should be much smaller than that and I would expect one of
the 9p commands to error out in between.

What would be a viable approach to avoid this issue on macOS?


It is unlikely that a naive brute-force approach will succeed to
exploit. The more concerning scenario is that the attacker uses the
knowledge of the underlying implementation of macOS to cause resource
contention to widen the window. Whether an exploitation is viable
depends on how much time you spend digging XNU.

However, I'm also not sure if it really *has* a race condition. Looking
at v9fs_co_mknod(), it sequentially calls s->ops->mknod() and
s->ops->lstat(). It also results in an entity called "path name based
fid" in the code, which inherently cannot identify a file when it is
renamed or recreated.

If there is some rationale it is safe, it may also be applied to the
sequence of bind() and chmod(). Can anyone explain the sequence of
s->ops->mknod() and s->ops->lstat() or path name based fid in general?


You are talking about 9p server's controller level: I don't see something that
would prevent a concurrent open() during this bind() ... chmod() time window
unfortunately.

Argument 'fidp' passed to function v9fs_co_mknod() reflects the directory in
which the new device file shall be created. So 'fidp' is not the device file
here, nor is 'fidp' modified during this function.

Function v9fs_co_mknod() is entered by 9p server on QEMU main thread. At the
beginning of the function it first acquires a read lock on a (per 9p export)
global coroutine mutex:

  v9fs_path_read_lock(s);

and holds this lock until returning from function v9fs_co_mknod(). But that's
just a read lock. Function v9fs_co_open() also just gains a read lock. So they
can happen concurrently.

Then v9fs_co_run_in_worker({...}) is called to dispatch and execute all the
code block (think of it as an Obj-C "block") inside this (macro actually) on a
QEMU worker thread. So an arbitrary background thread would then call the fs
driver functions:

  s->ops->mknod()
  v9fs_name_to_path()
  s->ops->lstat()

and then at the end of the code block the background thread would dispatch
back to QEMU main thread. So when we are reaching:

  v9fs_path_unlock(s);

we are already back on QEMU main thread, hence unlocking on main thread now
and finally leaving function v9fs_co_mknod().

The important thing to understand is, while that

  v9fs_co_run_in_worker({...})

code block is executed on a QEMU worker thread, the QEMU main thread (9p
server controller portion, i.e. 9p.c) is *not* sleeping, QEMU main thread
rather continues to process other (if any) client requests in the meantime. In
other words v9fs_co_run_in_worker() neither behaves exactly like Apple's GCD
dispatch_async(), nor like dispatch_sync(), as GCD is not coroutine based.

So 9p server might pull a pending 'Topen' client request from the input FIFO
in the meantime and likewise dispatch that to a worker thread, etc. Hence a
concurrent open() might in theory be possible, but I find it quite unlikely to
succeed in practice as the open() call on guest is translated by Linux client
into a bunch of synchronous 9p requests on the path passed with the open()
call on guest, and a round trip for each 9p message is like what, ~0.3ms or
something in this order. That's quite huge compared to the time window I would
expect between bind() ... open().

Does this answer your questions?


The time window may be widened by a malicious actor if the actor knows
XNU well so the window length inferred from experiences is not really
enough to claim it safe,

Re: [PATCH v2 41/43] target/loongarch: Add gdb support.

2022-04-26 Thread Richard Henderson

On 4/25/22 02:10, Xiaojuan Yang wrote:

Signed-off-by: Xiaojuan Yang
Signed-off-by: Song Gao
---
  MAINTAINERS |  1 +
  configs/targets/loongarch64-softmmu.mak |  1 +
  gdb-xml/loongarch-base64.xml| 44 ++
  gdb-xml/loongarch-fpu64.xml | 57 +
  target/loongarch/cpu.c  |  9 +++
  target/loongarch/gdbstub.c  | 81 +
  target/loongarch/internals.h|  4 ++
  target/loongarch/meson.build|  1 +
  8 files changed, 198 insertions(+)
  create mode 100644 gdb-xml/loongarch-base64.xml
  create mode 100644 gdb-xml/loongarch-fpu64.xml
  create mode 100644 target/loongarch/gdbstub.c


Reviewed-by: Richard Henderson 

r~



Re: [PATCH v2 39/43] hw/loongarch: Add LoongArch load elf function.

2022-04-26 Thread Richard Henderson

On 4/25/22 02:10, Xiaojuan Yang wrote:

--- a/target/loongarch/cpu.c
+++ b/target/loongarch/cpu.c
@@ -525,8 +525,12 @@ static const MemoryRegionOps loongarch_qemu_ops = {
  static void do_cpu_reset(void *opaque)
  {
  LoongArchCPU *cpu = opaque;
+CPULoongArchState *env = &cpu->env;
  
  cpu_reset(CPU(cpu));

+if (env->load_elf) {
+cpu_set_pc(CPU(cpu), env->elf_address);
+}


I guess this is one way to do it, but I did show you a way completely within 
the board code.


r~



Re: [PATCH v2 37/43] hw/loongarch: Add some devices support for 3A5000.

2022-04-26 Thread Richard Henderson

On 4/25/22 02:10, Xiaojuan Yang wrote:

1.Add uart,virtio-net,vga and usb for 3A5000.
2.Add irq set and map for the pci host. Non pci device
use irq 0-16, pci device use 16-64.
3.Add some unimplented device to emulate guest unused
memory space.

Signed-off-by: Xiaojuan Yang
Signed-off-by: Song Gao
---
  hw/loongarch/Kconfig   |  7 
  hw/loongarch/loongson3.c   | 77 ++
  include/hw/pci-host/ls7a.h |  8 
  3 files changed, 92 insertions(+)


Acked-by: Richard Henderson 

r~



PING: [PATCH] KVM: HWPoison: Fix memory address&size during remap

2022-04-26 Thread zhenwei pi

Hi, Paolo & Peter

Could you please review this patch?

On 4/20/22 14:45, zhenwei pi wrote:

qemu exits during reset with log:
qemu-system-x86_64: Could not remap addr: 1000@22001000

Currently, after MCE on RAM of a guest, qemu records a ram_addr only,
remaps this address with a fixed size(TARGET_PAGE_SIZE) during reset.
In the hugetlbfs scenario, mmap(addr...) needs page_size aligned
address and correct size. Unaligned address leads mmap to fail.

What's more, hitting MCE on RAM of a guest, qemu records this address
and try to fix it during reset, this should be a common logic. So
remove kvm_hwpoison_page_add from architecture dependent code, record
this in SIGBUS handler instead. Finally poisoning/unpoisoning a page
gets static in kvm-all.c,

Signed-off-by: zhenwei pi 
---
  accel/kvm/kvm-all.c  | 47 ++--
  include/sysemu/kvm_int.h | 12 --
  target/arm/kvm64.c   |  1 -
  target/i386/kvm/kvm.c|  1 -
  4 files changed, 36 insertions(+), 25 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 5f1377ca04..2a91c5a461 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1167,11 +1167,14 @@ int kvm_vm_check_extension(KVMState *s, unsigned int 
extension)
  return ret;
  }
  
+#ifdef KVM_HAVE_MCE_INJECTION

  typedef struct HWPoisonPage {
  ram_addr_t ram_addr;
+size_t page_size; /* normal page or hugeTLB page? */
  QLIST_ENTRY(HWPoisonPage) list;
  } HWPoisonPage;
  
+/* hwpoison_page_list stores the poisoned pages, unpoison them during reset */

  static QLIST_HEAD(, HWPoisonPage) hwpoison_page_list =
  QLIST_HEAD_INITIALIZER(hwpoison_page_list);
  
@@ -1181,25 +1184,48 @@ static void kvm_unpoison_all(void *param)
  
  QLIST_FOREACH_SAFE(page, &hwpoison_page_list, list, next_page) {

  QLIST_REMOVE(page, list);
-qemu_ram_remap(page->ram_addr, TARGET_PAGE_SIZE);
+qemu_ram_remap(page->ram_addr, page->page_size);
  g_free(page);
  }
  }
  
-void kvm_hwpoison_page_add(ram_addr_t ram_addr)

+static void kvm_hwpoison_page_add(CPUState *cpu, int sigbus_code, void *addr)
  {
  HWPoisonPage *page;
+ram_addr_t ram_addr, align_ram_addr;
+ram_addr_t offset;
+hwaddr paddr;
+size_t page_size;
+
+assert(sigbus_code == BUS_MCEERR_AR || sigbus_code == BUS_MCEERR_AO);
+ram_addr = qemu_ram_addr_from_host(addr);
+if (ram_addr == RAM_ADDR_INVALID ||
+!kvm_physical_memory_addr_from_host(cpu->kvm_state, addr, &paddr)) {
+/* only deal with valid guest RAM here */
+return;
+}
  
+/* get page size of RAM block, test it's a normal page or huge page */

+page_size = qemu_ram_block_from_host(addr, false, &offset)->page_size;
+align_ram_addr = QEMU_ALIGN_DOWN(ram_addr, page_size);
  QLIST_FOREACH(page, &hwpoison_page_list, list) {
-if (page->ram_addr == ram_addr) {
+if (page->ram_addr == align_ram_addr) {
+assert(page->page_size == page_size);
  return;
  }
  }
-page = g_new(HWPoisonPage, 1);
-page->ram_addr = ram_addr;
+
+page = g_new0(HWPoisonPage, 1);
+page->ram_addr = align_ram_addr;
+page->page_size = page_size;
  QLIST_INSERT_HEAD(&hwpoison_page_list, page, list);
  }
  
+static __thread void *pending_sigbus_addr;

+static __thread int pending_sigbus_code;
+static __thread bool have_sigbus_pending;
+#endif
+
  static uint32_t adjust_ioeventfd_endianness(uint32_t val, uint32_t size)
  {
  #if defined(HOST_WORDS_BIGENDIAN) != defined(TARGET_WORDS_BIGENDIAN)
@@ -2601,7 +2627,9 @@ static int kvm_init(MachineState *ms)
  s->kernel_irqchip_split = mc->default_kernel_irqchip_split ? 
ON_OFF_AUTO_ON : ON_OFF_AUTO_OFF;
  }
  
+#if defined KVM_HAVE_MCE_INJECTION

  qemu_register_reset(kvm_unpoison_all, NULL);
+#endif
  
  if (s->kernel_irqchip_allowed) {

  kvm_irqchip_create(s);
@@ -2782,12 +2810,6 @@ void kvm_cpu_synchronize_pre_loadvm(CPUState *cpu)
  run_on_cpu(cpu, do_kvm_cpu_synchronize_pre_loadvm, RUN_ON_CPU_NULL);
  }
  
-#ifdef KVM_HAVE_MCE_INJECTION

-static __thread void *pending_sigbus_addr;
-static __thread int pending_sigbus_code;
-static __thread bool have_sigbus_pending;
-#endif
-
  static void kvm_cpu_kick(CPUState *cpu)
  {
  qatomic_set(&cpu->kvm_run->immediate_exit, 1);
@@ -2883,6 +2905,8 @@ int kvm_cpu_exec(CPUState *cpu)
  #ifdef KVM_HAVE_MCE_INJECTION
  if (unlikely(have_sigbus_pending)) {
  qemu_mutex_lock_iothread();
+kvm_hwpoison_page_add(cpu, pending_sigbus_code,
+  pending_sigbus_addr);
  kvm_arch_on_sigbus_vcpu(cpu, pending_sigbus_code,
  pending_sigbus_addr);
  have_sigbus_pending = false;
@@ -3436,6 +3460,7 @@ int kvm_on_sigbus(int code, void *addr)
   * we can only get action optional here.
   */
  assert(code != BUS_MCEERR_AR);
+kvm_hwpoison_

Re: [PATCH v2 34/43] hw/intc: Add LoongArch extioi interrupt controller(EIOINTC)

2022-04-26 Thread Richard Henderson

On 4/25/22 02:10, Xiaojuan Yang wrote:

This patch realize the EIOINTC interrupt controller.

Signed-off-by: Xiaojuan Yang 
Signed-off-by: Song Gao 
---
  hw/intc/Kconfig|   3 +
  hw/intc/loongarch_extioi.c | 483 +
  hw/intc/meson.build|   1 +
  hw/intc/trace-events   |   9 +
  hw/loongarch/Kconfig   |   1 +
  include/hw/intc/loongarch_extioi.h |  60 
  6 files changed, 557 insertions(+)
  create mode 100644 hw/intc/loongarch_extioi.c
  create mode 100644 include/hw/intc/loongarch_extioi.h

diff --git a/hw/intc/Kconfig b/hw/intc/Kconfig
index 71c04c328e..28bd1f185d 100644
--- a/hw/intc/Kconfig
+++ b/hw/intc/Kconfig
@@ -96,3 +96,6 @@ config LOONGARCH_PCH_MSI
  select MSI_NONBROKEN
  bool
  select UNIMP
+
+config LOONGARCH_EXTIOI
+bool
diff --git a/hw/intc/loongarch_extioi.c b/hw/intc/loongarch_extioi.c
new file mode 100644
index 00..1d9317c5bd
--- /dev/null
+++ b/hw/intc/loongarch_extioi.c
@@ -0,0 +1,483 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Loongson 3A5000 ext interrupt controller emulation
+ *
+ * Copyright (C) 2021 Loongson Technology Corporation Limited
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/module.h"
+#include "qemu/log.h"
+#include "hw/irq.h"
+#include "hw/sysbus.h"
+#include "hw/loongarch/virt.h"
+#include "hw/qdev-properties.h"
+#include "exec/address-spaces.h"
+#include "hw/intc/loongarch_extioi.h"
+#include "migration/vmstate.h"
+#include "trace.h"
+
+static void extioi_update_irq(LoongArchExtIOI *s, int irq_num, int level)
+{
+uint64_t ipnum, cpu;
+
+/*
+ * Routing in group of 32 interrupts.
+ * The default value of csr[0x420][49]
+ * is 0 and nobody will change it,
+ * so 'ipmap' use bitmap function.
+ */
+ipnum = ldub_p((void *)&s->ipmap + (irq_num / 32)) & 0xf;
+ipnum = find_first_bit(&ipnum, 4);
+ipnum = (ipnum == 4) ? 0 : ipnum;
+
+cpu = ldub_p((void *)s->coremap + irq_num) & 0xf;
+cpu = find_first_bit(&cpu, 4);
+cpu = (cpu == 4) ? 0 : cpu;
+
+if (level) {
+if (test_bit(irq_num, (unsigned long *)s->enable) == false) {
+return;
+}
+bitmap_set((unsigned long *)s->coreisr[cpu], irq_num, 1);
+qemu_set_irq(s->parent_irq[cpu][ipnum], level);
+} else {
+bitmap_clear((unsigned long *)s->coreisr[cpu], irq_num, 1);
+qemu_set_irq(s->parent_irq[cpu][ipnum], level);


I told you before: all of this casting is wrong.
You cannot use only part of the bitmap.h interface.


+#define MAX_CORES  LOONGARCH_MAX_VCPUS


Why do you have so many MAX cpu defines, all different?


r~



Re: [PATCH qemu 1/9] target/riscv: rvv: Add mask agnostic for vv instructions

2022-04-26 Thread eop Chen


> 
> 在 2022/4/27 上午2:20, eop Chen 写道:
>> 
>>> Weiwei Li mailto:liwei...@iscas.ac.cn>> 於 2022年4月26日 
>>> 下午4:47 寫道:
>>> 在 2022/3/17 下午3:26, ~eopxd 写道:
 From: Yueh-Ting (eop) Chen  
 
 
 This is the first commit regarding the mask agnostic behavior.
 Added option 'rvv_ma_all_1s' to enable the behavior, the option
 is default to false.
 
 Signed-off-by: eop Chen  
 Reviewed-by: Frank Chang  
 
>>> Similar to our last discussion,  vext_set_elems_1s_fns array can be 
>>> simplified to single vext_set_elems_1s,
>>> 
>>> since the fourth argement can be used as the start offset. 
>>> 
>>> Another question, may be not related to this patchset, in section 3.4.3 of 
>>> the spec: 
>>> 
>>> "Mask destination tail elements are always treated as tail-agnostic, 
>>> regardless of the setting of vta."
>>> 
>>> What does "Mask destination tail elements" mean?
>>> 
>>> Regards,
>>> 
>>> Weiwei Li
>>> 
>> 
>> 
>> I have just updated a new version for the tail agnostic patch set, it also 
>> includes a bug fix discovered today.
>> 
>> Regarding the question, mask / masked-off are for body elements and respects 
>> the mask policy, and tail elements respect the tail policy.
>> 
>> Regards,
>> 
>> eop Chen
> 
> I find another descriptions in the spec. For the instructions that write mask 
> register (such as vmadc, vmseq,vmsne,vmfeq...), they all have similar 
> description
> 
> (You can search "tail-agnostic" in the spec):
> 
> Section 11.4: "Because these instructions produce a mask value, they always 
> operate with a tail-agnostic policy"
> 
> Section 11.8/13.13: "Compares write mask registers, and so always operate 
> under a tail-agnostic policy"
> 
> Section 15.1: "Mask elements past vl, the tail elements, are always updated 
> with a tail-agnostic policy"
> 
> Section 15.4/15.5/15.6: "The tail elements in the destination mask register 
> are updated under a tail-agnostic policy"
> 
> So I think "Mask destination tail elements" may means the tail element for 
> instructions that take mask register as destination register.  For these 
> instructions, 
> 
> maybe the setting of VTA can be ignored.  
> 
> Regards,
> 
> Weiwei Li
> 

Yes, the setting of VTA should be ignored when v-spec specifies.
I think the implementation of the tail agnostic patch set don’t need to change 
on this.

Regards,

eop Chen





Re: [PATCH v2 33/43] hw/intc: Add LoongArch ls7a msi interrupt controller support(PCH-MSI)

2022-04-26 Thread Richard Henderson

On 4/25/22 02:10, Xiaojuan Yang wrote:

+static void loongarch_msi_mem_write(void *opaque, hwaddr addr,
+uint64_t val, unsigned size)
+{
+LoongArchPCHMSI *s = LOONGARCH_PCH_MSI(opaque);
+int irq_num = val & 0xff;
+
+trace_loongarch_msi_set_irq(irq_num);
+qemu_set_irq(s->pch_msi_irq[irq_num - PCH_PIC_IRQ_NUM], 1);
+}


This needs to be bounded properly -- irq_num >= PCH_PIC_IRQ_NUM -- since 'val' is under 
the control of the guest.




+static void pch_msi_irq_handler(void *opaque, int irq, int level)
+{
+LoongArchPCHMSI *s = LOONGARCH_PCH_MSI(opaque);
+
+qemu_set_irq(s->pch_msi_irq[irq], level);
+}


You should be able to connect the gpio lines directly, rather than having a pass-through 
function like this.  I think this is sysbus_pass_irq.



r~



PING: [PATCH v2 0/2] qga: Support NVMe disk type and SMART

2022-04-26 Thread zhenwei pi

Hi, Markus & Michael

Could you please give any hint about this patch?

On 4/20/22 10:26, zhenwei pi wrote:

v1 -> v2:
  - Update version from 7.0 to 7.1.

v1:
  - Introduce NVMe type for command 'guest-get-disks'.
  - Introduce SMART, and implement NVMe SMART for command 'guest-get-disks'.

Zhenwei Pi (2):
   qga: Introduce NVMe disk bus type
   qga: Introduce disk smart

  qga/commands-posix.c | 78 +++-
  qga/qapi-schema.json | 56 +--
  2 files changed, 131 insertions(+), 3 deletions(-)



--
zhenwei pi



Re: [PATCH v2 31/43] hw/loongarch: Add LoongArch ipi interrupt support(IPI)

2022-04-26 Thread Richard Henderson

On 4/25/22 02:10, Xiaojuan Yang wrote:

This patch realize the IPI interrupt controller.

Signed-off-by: Xiaojuan Yang
Signed-off-by: Song Gao
---
  MAINTAINERS |   2 +
  hw/intc/Kconfig |   3 +
  hw/intc/loongarch_ipi.c | 166 
  hw/intc/meson.build |   1 +
  hw/intc/trace-events|   4 +
  hw/loongarch/Kconfig|   1 +
  include/hw/intc/loongarch_ipi.h |  50 ++
  include/hw/loongarch/virt.h |   2 +
  8 files changed, 229 insertions(+)
  create mode 100644 hw/intc/loongarch_ipi.c
  create mode 100644 include/hw/intc/loongarch_ipi.h


Reviewed-by: Richard Henderson 

r~



Re: [PATCH v2 30/43] hw/loongarch: Add support loongson3 virt machine type.

2022-04-26 Thread Richard Henderson

On 4/25/22 02:10, Xiaojuan Yang wrote:

Emulate a 3A5000 board use the new loongarch instruction.
3A5000 belongs to the Loongson3 series processors.
The board consists of a 3A5000 cpu model and the virt
bridge. The host 3A5000 board is really complicated and
contains many functions.Now for the tcg softmmu mode
only part functions are emulated.

More detailed info you can see
https://github.com/loongson/LoongArch-Documentation

Signed-off-by: Xiaojuan Yang
Signed-off-by: Song Gao
---


Reviewed-by: Richard Henderson 

r~



Re: [PATCH v2 26/43] target/loongarch: Add LoongArch IOCSR instruction

2022-04-26 Thread Richard Henderson

On 4/25/22 02:10, Xiaojuan Yang wrote:

This includes:
- IOCSR{RD/WR}.{B/H/W/D}

Signed-off-by: Xiaojuan Yang 
Signed-off-by: Song Gao 
---
  target/loongarch/cpu.c|  52 ++
  target/loongarch/cpu.h|  25 +++
  target/loongarch/disas.c  |   8 +
  target/loongarch/helper.h |   8 +
  .../insn_trans/trans_privileged.c.inc |  35 
  target/loongarch/insns.decode |   9 +
  target/loongarch/iocsr_helper.c   | 174 ++
  target/loongarch/meson.build  |   1 +
  8 files changed, 312 insertions(+)
  create mode 100644 target/loongarch/iocsr_helper.c

diff --git a/target/loongarch/cpu.c b/target/loongarch/cpu.c
index 6d6216a846..3167323821 100644
--- a/target/loongarch/cpu.c
+++ b/target/loongarch/cpu.c
@@ -17,6 +17,8 @@
  #include "internals.h"
  #include "fpu/softfloat-helpers.h"
  #include "cpu-csr.h"
+#include "sysemu/reset.h"
+#include "hw/loader.h"
  
  const char * const regnames[32] = {

  "r0", "r1", "r2", "r3", "r4", "r5", "r6", "r7",
@@ -483,14 +485,64 @@ static void loongarch_cpu_realizefn(DeviceState *dev, 
Error **errp)
  lacc->parent_realize(dev, errp);
  }
  
+static void loongarch_qemu_write(void *opaque, hwaddr addr,

+ uint64_t val, unsigned size)
+{
+}
+
+static uint64_t loongarch_qemu_read(void *opaque, hwaddr addr, unsigned size)
+{
+switch (addr) {
+case FEATURE_REG:
+return 1UL << IOCSRF_MSI | 1UL << IOCSRF_EXTIOI |
+   1UL << IOCSRF_CSRIPI;
+case VENDOR_REG:
+return 0x6e6f73676e6f6f4c; /* "Loongson" */
+case CPUNAME_REG:
+return 0x303030354133; /* "3A5000" */
+case MISC_FUNC_REG:
+return 1UL << IOCSRM_EXTIOI_EN;
+}
+return 0;
+}
+
+static const MemoryRegionOps loongarch_qemu_ops = {
+.read = loongarch_qemu_read,
+.write = loongarch_qemu_write,
+.endianness = DEVICE_LITTLE_ENDIAN,
+.valid = {
+.min_access_size = 4,
+.max_access_size = 8,
+},
+.impl = {
+.min_access_size = 8,
+.max_access_size = 8,
+},
+};
+
+static void do_cpu_reset(void *opaque)
+{
+LoongArchCPU *cpu = opaque;
+
+cpu_reset(CPU(cpu));
+}
+
  static void loongarch_cpu_init(Object *obj)
  {
  LoongArchCPU *cpu = LOONGARCH_CPU(obj);
+CPULoongArchState *env = &cpu->env;
  
  cpu_set_cpustate_pointers(cpu);

  qdev_init_gpio_in(DEVICE(cpu), loongarch_cpu_set_irq, N_IRQS);
  timer_init_ns(&cpu->timer, QEMU_CLOCK_VIRTUAL,
&loongarch_constant_timer_cb, cpu);
+memory_region_init_io(&env->system_iocsr, OBJECT(cpu), NULL,
+  env, "iocsr", UINT64_MAX);
+address_space_init(&env->address_space_iocsr, &env->system_iocsr, "IOCSR");
+qemu_register_reset(do_cpu_reset, cpu);
+memory_region_init_io(&env->iocsr_mem, OBJECT(cpu), &loongarch_qemu_ops,
+  NULL, "iocsr_misc", 0x428);
+memory_region_add_subregion(&env->system_iocsr, 0, &env->iocsr_mem);
  }
  
  static ObjectClass *loongarch_cpu_class_by_name(const char *cpu_model)

diff --git a/target/loongarch/cpu.h b/target/loongarch/cpu.h
index e907fe3c51..734f90168e 100644
--- a/target/loongarch/cpu.h
+++ b/target/loongarch/cpu.h
@@ -12,6 +12,27 @@
  #include "fpu/softfloat-types.h"
  #include "hw/registerfields.h"
  #include "qemu/timer.h"
+#include "exec/memory.h"
+#include "hw/sysbus.h"
+
+#define IOCSRF_TEMP 0
+#define IOCSRF_NODECNT  1
+#define IOCSRF_MSI  2
+#define IOCSRF_EXTIOI   3
+#define IOCSRF_CSRIPI   4
+#define IOCSRF_FREQCSR  5
+#define IOCSRF_FREQSCALE6
+#define IOCSRF_DVFSV1   7
+#define IOCSRF_GMOD 9
+#define IOCSRF_VM   11
+
+#define FEATURE_REG 0x8
+#define VENDOR_REG  0x10
+#define CPUNAME_REG 0x20
+#define MISC_FUNC_REG   0x420
+#define IOCSRM_EXTIOI_EN48
+
+#define IOCSR_MEM_SIZE  0x428
  
  #define TCG_GUEST_DEFAULT_MO (0)
  
@@ -284,6 +305,10 @@ typedef struct CPUArchState {

  uint64_t CSR_DSAVE;
  
  LoongArchTLB  tlb[LOONGARCH_TLB_MAX];

+
+AddressSpace address_space_iocsr;
+MemoryRegion system_iocsr;
+MemoryRegion iocsr_mem;
  } CPULoongArchState;
  
  /**

diff --git a/target/loongarch/disas.c b/target/loongarch/disas.c
index c76b82dfdf..fcc06abc6b 100644
--- a/target/loongarch/disas.c
+++ b/target/loongarch/disas.c
@@ -609,6 +609,14 @@ INSN(bgeu, rr_offs)
  INSN(csrrd,r_csr)
  INSN(csrwr,r_csr)
  INSN(csrxchg,  rr_csr)
+INSN(iocsrrd_b,rr)
+INSN(iocsrrd_h,rr)
+INSN(iocsrrd_w,rr)
+INSN(iocsrrd_d,rr)
+INSN(iocsrwr_b,rr)
+INSN(iocsrwr_h,rr)
+INSN(iocsrwr_w,rr)
+INSN(iocsrwr_d,rr)
  
  #define output_fcmp(C, PREFIX, SUFFIX) \

  {

Re: [PATCH v2 18/26] io: make qio_channel_command_new_pid() static

2022-04-26 Thread Richard Henderson

On 4/26/22 02:27, marcandre.lur...@redhat.com wrote:

From: Marc-André Lureau

The function isn't used outside of qio_channel_command_new_spawn(),
which is !win32-specific.

Signed-off-by: Marc-André Lureau
---
  include/io/channel-command.h | 25 -
  io/channel-command.c | 26 ++
  2 files changed, 22 insertions(+), 29 deletions(-)


Reviewed-by: Richard Henderson 

r~



Re: [PATCH v2 17/26] Replace fcntl(0_NONBLOCK) with g_unix_set_fd_nonblocking()

2022-04-26 Thread Richard Henderson

On 4/26/22 02:27, marcandre.lur...@redhat.com wrote:

From: Marc-André Lureau

Suggested-by: Daniel P. Berrangé
Signed-off-by: Marc-André Lureau
---
  net/tap-bsd.c  |  4 ++--
  net/tap-linux.c|  2 +-
  net/tap-solaris.c  |  2 +-
  tests/qtest/fuzz/virtio_net_fuzz.c |  2 +-
  tests/unit/test-iov.c  |  4 ++--
  util/oslib-posix.c | 16 ++--
  6 files changed, 9 insertions(+), 21 deletions(-)


Typo in subject: s/0/O/.

Reviewed-by: Richard Henderson 

r~



Re: [PATCH] docs: Correct the default thread-pool-size

2022-04-26 Thread liuyd.f...@fujitsu.com
[+cc qemu-trivial]

On 4/14/22 8:19 PM, Vivek Goyal wrote:
> On Wed, Apr 13, 2022 at 12:20:54PM +0800, Liu Yiding wrote:
>> Refer to 26ec190964 virtiofsd: Do not use a thread pool by default
>>
>> Signed-off-by: Liu Yiding 
> Looks good. Our default used to be --thread-pool-size=64. But we changed
> it to using no thread pool because on lower end of workloads it performed
> better. When multiple threads are doing parallel I/O then, thread pool
> helps. So people who want to do lots of parallel I/O should manually
> enable thread pool.
>
> Acked-by: Vivek Goyal 
>
> Vivek
>> ---
>>   docs/tools/virtiofsd.rst | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/docs/tools/virtiofsd.rst b/docs/tools/virtiofsd.rst
>> index 0c0560203c..33fed08c6f 100644
>> --- a/docs/tools/virtiofsd.rst
>> +++ b/docs/tools/virtiofsd.rst
>> @@ -127,7 +127,7 @@ Options
>>   .. option:: --thread-pool-size=NUM
>>   
>> Restrict the number of worker threads per request queue to NUM.  The 
>> default
>> -  is 64.
>> +  is 0.
>>   
>>   .. option:: --cache=none|auto|always
>>   
>> -- 
>> 2.31.1
>>
>>
>>
>>
-- 
Best Regards.
Yiding Liu


Re: [PATCH v2 14/26] os-posix: replace pipe()+cloexec with g_unix_open_pipe(CLOEXEC)

2022-04-26 Thread Richard Henderson

On 4/26/22 02:27, marcandre.lur...@redhat.com wrote:

From: Marc-André Lureau

Suggested-by: Daniel P. Berrangé
Signed-off-by: Marc-André Lureau
---
  os-posix.c | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)


Reviewed-by: Richard Henderson 



-if (pipe(fds) == -1) {
+if (!g_unix_open_pipe(fds, FD_CLOEXEC, NULL)) {
 exit(1);
 }


We could do better than exit without error message, though pipe failure is 
pretty rare.


r~



Re: [PATCH v2 12/26] qga: replace pipe() with g_unix_open_pipe(CLOEXEC)

2022-04-26 Thread Richard Henderson

On 4/26/22 02:27, marcandre.lur...@redhat.com wrote:

From: Marc-André Lureau 

Suggested-by: Daniel P. Berrangé 
Signed-off-by: Marc-André Lureau 
---
  qga/commands-posix.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/qga/commands-posix.c b/qga/commands-posix.c
index 77f4672ca2c9..094487c2c395 100644
--- a/qga/commands-posix.c
+++ b/qga/commands-posix.c
@@ -2529,7 +2529,7 @@ void qmp_guest_set_user_password(const char *username,
  goto out;
  }
  
-if (pipe(datafd) < 0) {

+if (!g_unix_open_pipe(datafd, FD_CLOEXEC, NULL)) {
  error_setg(errp, "cannot create pipe FDs");
  goto out;
  }


This looks wrong, since the next thing that happens is fork+execl.


r~



Re: [PATCH v2 11/26] util: replace pipe()+cloexec with g_unix_open_pipe()

2022-04-26 Thread Richard Henderson

On 4/26/22 02:27, marcandre.lur...@redhat.com wrote:

From: Marc-André Lureau

Suggested-by: Daniel P. Berrangé
Signed-off-by: Marc-André Lureau
---
  util/compatfd.c | 5 +
  1 file changed, 1 insertion(+), 4 deletions(-)


Reviewed-by: Richard Henderson 

r~



Re: [PATCH v2 10/26] Replace qemu_pipe() with g_unix_open_pipe()

2022-04-26 Thread Richard Henderson

On 4/26/22 02:26, marcandre.lur...@redhat.com wrote:

From: Marc-André Lureau

GLib g_unix_open_pipe() is essentially like qemu_pipe(), available since
2.30.

Signed-off-by: Marc-André Lureau
---
  include/qemu/osdep.h|  4 
  qemu-nbd.c  |  5 +++--
  util/event_notifier-posix.c |  2 +-
  util/oslib-posix.c  | 22 --
  4 files changed, 4 insertions(+), 29 deletions(-)


Reviewed-by: Richard Henderson 

r~



Re: [PATCH qemu 1/9] target/riscv: rvv: Add mask agnostic for vv instructions

2022-04-26 Thread Weiwei Li


在 2022/4/27 上午2:20, eop Chen 写道:


Weiwei Li mailto:liwei...@iscas.ac.cn>> 於 
2022年4月26日 下午4:47 寫道:

在 2022/3/17 下午3:26, ~eopxd 写道:

From: Yueh-Ting (eop) Chen

This is the first commit regarding the mask agnostic behavior.
Added option 'rvv_ma_all_1s' to enable the behavior, the option
is default to false.

Signed-off-by: eop Chen
Reviewed-by: Frank Chang


Similar to our last discussion, vext_set_elems_1s_fns array can be 
simplified to single vext_set_elems_1s,


since the fourth argement can be used as the start offset.

Another question, may be not related to this patchset, in section 
3.4.3 of the spec:


/"Mask destination tail elements are always treated as tail-agnostic, 
regardless of the setting of vta."/


What does "Mask destination tail elements" mean?

Regards,

Weiwei Li



I have just updated a new version for the tail agnostic patch set, it 
also includes a bug fix discovered today.


Regarding the question, mask / masked-off are for body elements and 
respects the mask policy, and tail elements respect the tail policy.


Regards,

eop Chen


I find another descriptions in the spec. For the instructions that write 
mask register (such as vmadc, vmseq,vmsne,vmfeq...), they all have 
similar description


(You can search "tail-agnostic" in the spec):

/Section 11.4: "Because these instructions produce a mask value, they 
always operate with a tail-agnostic policy"//

/

/Section 11.8/13.13: "Compares //write mask registers, and so always 
operate under a tail-agnostic policy"//

/

/Section 15.1: "Mask elements past vl, the tail elements, are always 
updated with a tail-agnostic policy"//

/

//

/Section 15.4/15.5/15.6: "The tail elements in the destination mask 
register are updated under a tail-agnostic policy"/


So I think "Mask destination tail elements" may means the tail element 
for instructions that take mask register as destination register.  For 
these instructions,


maybe the setting of VTA can be ignored.

Regards,

Weiwei Li



Re: [PATCH v2 09/26] block: move fcntl_setfl()

2022-04-26 Thread Richard Henderson

On 4/26/22 02:26, marcandre.lur...@redhat.com wrote:

From: Marc-André Lureau

It is only used by block/file-posix.c, move it there.

Signed-off-by: Marc-André Lureau
---
  include/sysemu/os-posix.h |  2 --
  block/file-posix.c| 15 +++
  util/oslib-posix.c| 15 ---
  3 files changed, 15 insertions(+), 17 deletions(-)


Reviewed-by: Richard Henderson 

r~



Re: [PATCH v2 08/26] Use g_unix_set_fd_nonblocking()

2022-04-26 Thread Richard Henderson

On 4/26/22 02:26, marcandre.lur...@redhat.com wrote:

From: Marc-André Lureau

API available since glib 2.30. It also preserves errno.

Signed-off-by: Marc-André Lureau
---
  hw/misc/ivshmem.c   | 2 +-
  util/event_notifier-posix.c | 6 ++
  util/main-loop.c| 2 +-
  3 files changed, 4 insertions(+), 6 deletions(-)


Reviewed-by: Richard Henderson 

r~



Re: [PATCH v2 4/5] target/riscv: Add stimecmp support

2022-04-26 Thread Richard Henderson

On 4/26/22 16:08, Atish Patra wrote:

+static RISCVException write_stimecmp(CPURISCVState *env, int csrno,
+target_ulong val)
+{
+RISCVCPU *cpu = env_archcpu(env);
+
+if (riscv_cpu_mxl(env) == MXL_RV32) {
+uint64_t stimecmp_hi = env->stimecmp >> 32;
+env->stimecmp = (stimecmp_hi << 32) | (val & 0x);
+} else {
+env->stimecmp = val;
+riscv_timer_write_timecmp(cpu, env->stimer, env->stimecmp, 0, 
MIP_STIP);
+}
+
+return RISCV_EXCP_NONE;
+}
+
+static RISCVException write_stimecmph(CPURISCVState *env, int csrno,
+target_ulong val)
+{
+RISCVCPU *cpu = env_archcpu(env);
+uint64_t timer_val = 0;
+
+timer_val = (uint64_t)val << 32 | (env->stimecmp & 0x);
+env->stimecmp = timer_val;
+riscv_timer_write_timecmp(cpu, env->stimer, env->stimecmp, 0, MIP_STIP);
+
+return RISCV_EXCP_NONE;
+}


Use deposit64() instead of open-coding the inserts.


r~



Re: [PATCH v2 3/5] hw/intc: Support migration of aclint device

2022-04-26 Thread Richard Henderson

On 4/26/22 16:08, Atish Patra wrote:

mtimecmp is part of ACLINT device now. This needs to preserved across
migration.

Signed-off-by: Atish Patra 


This must be squashed with patch 1.


+.fields = (VMStateField[]) {
+VMSTATE_VARRAY_UINT64(timecmp, RISCVAclintMTimerState,


And this must be UINT32 to match num_harts.


r~



Re: [PATCH v2 1/5] hw/intc: Move mtimer/mtimecmp to aclint

2022-04-26 Thread Richard Henderson

On 4/26/22 16:08, Atish Patra wrote:

@@ -334,7 +334,6 @@ const VMStateDescription vmstate_riscv_cpu = {
  VMSTATE_UINTTL(env.mscratch, RISCVCPU),
  VMSTATE_UINT64(env.mfromhost, RISCVCPU),
  VMSTATE_UINT64(env.mtohost, RISCVCPU),
-VMSTATE_UINT64(env.timecmp, RISCVCPU),
  


Must bump version_id and minimum_version_id.

r~



Re: [PATCH v2 2/5] migration: Add 64bit variable array data type

2022-04-26 Thread Richard Henderson

On 4/26/22 16:08, Atish Patra wrote:

+.num_offset = vmstate_offset_value(_state, _field_num, uint32_t),\

...

  } else if (field->flags & VMS_VARRAY_UINT32) {
  n_elems = *(uint32_t *)(opaque + field->num_offset);
+} else if (field->flags & VMS_VARRAY_UINT64) {
+n_elems = *(uint64_t *)(opaque + field->num_offset);


Offset type mismatch.  Since num_harts is uint32_t, I don't believe you need this patch at 
all.



r~



Re: [PULL 00/13] NBD patches through 2022-04-26

2022-04-26 Thread Richard Henderson

On 4/26/22 13:15, Eric Blake wrote:

The following changes since commit 80a172de5592b5c33aa6bc30da6f16c4ad1ae390:

   Merge tag 'trivial-branch-for-7.1-pull-request' of 
https://gitlab.com/laurent_vivier/qemu into staging (2022-04-26 07:54:22 -0700)

are available in the Git repository at:

   https://repo.or.cz/qemu/ericb.git tags/pull-nbd-2022-04-26

for you to fetch changes up to 620c5cb5da57dc97f655e6218e7ca9bc896394a2:

   nbd: document what is protected by the CoMutexes (2022-04-26 13:16:42 -0500)


nbd patches for 2022-04-26

- Paolo Bonzini: thread-safety improvements to NBD client
- Vladimir Sementsov-Ogievsky: finer-grained selection of bitmaps during
   nbd-export



Applied, thanks.  Please update https://wiki.qemu.org/ChangeLog/7.1 as 
appropriate.


r~





Paolo Bonzini (9):
   nbd: safeguard against waking up invalid coroutine
   nbd: mark more coroutine_fns
   nbd: remove peppering of nbd_client_connected
   nbd: keep send_mutex/free_sema handling outside 
nbd_co_do_establish_connection
   nbd: use a QemuMutex to synchronize yanking, reconnection and coroutines
   nbd: code motion and function renaming
   nbd: move s->state under requests_lock
   nbd: take receive_mutex when reading requests[].receiving
   nbd: document what is protected by the CoMutexes

Vladimir Sementsov-Ogievskiy (3):
   qapi: rename BlockDirtyBitmapMergeSource to BlockDirtyBitmapOrStr
   qapi: nbd-export: allow select bitmaps by node/name pair
   iotests/223: check new possibility of exporting bitmaps by node/name

  qapi/block-core.json   |   6 +-
  qapi/block-export.json |   5 +-
  block/coroutines.h |   5 +-
  include/block/block_int-global-state.h |   2 +-
  block/monitor/bitmap-qmp-cmds.c|   6 +-
  block/nbd.c| 286 +
  blockdev-nbd.c |   8 +-
  nbd/server.c   |  63 +---
  qemu-img.c |   8 +-
  qemu-nbd.c |  11 +-
  tests/qemu-iotests/223 |  16 ++
  tests/qemu-iotests/223.out |  47 +-
  12 files changed, 282 insertions(+), 181 deletions(-)






Re: [RFC PATCH 6/7] target/ppc: Implemented pmxvf*ger*

2022-04-26 Thread Richard Henderson

On 4/26/22 05:50, Lucas Mateus Castro(alqotel) wrote:

+&MMIRR_XX3_NO_P xa xb xt xmsk ymsk


Don't create this...


+@MMIRR_XX3_NO_P .. ..  .. . .  xmsk:4  \
+.. ... .. . .  ... \
+&MMIRR_XX3_NO_P xb=%xx3_xb xt=%xx3_at


just set pmsk=1 here instead...


+static bool do_ger_MMIRR_XX3_NO_PMSK(DisasContext *ctx, arg_MMIRR_XX3_NO_P *a,
+ int op_flag, void (*helper)(TCGv_env,
+ TCGv_i32, TCGv_i32, TCGv_i32,
+ TCGv_i32, TCGv_i32))
+{
+arg_MMIRR_XX3 x;
+x.xa = a->xa;
+x.xb = a->xb;
+x.xt = a->xt;
+x.pmsk = 0x1;
+x.ymsk = a->ymsk;
+x.xmsk = a->xmsk;
+return do_ger_MMIRR_XX3(ctx, &x, op_flag, helper);
+}


so you can drop this.


r~



Re: [RFC PATCH 5/7] target/ppc: Implemented xvf16ger*

2022-04-26 Thread Richard Henderson

On 4/26/22 05:50, Lucas Mateus Castro(alqotel) wrote:

+#define VSXGER16(NAME, ORIG_T, OR_EL)   \
+void NAME(CPUPPCState *env, uint32_t a_r, uint32_t b_r, \
+  uint32_t  at_r, uint32_t mask, uint32_t packed_flags) \
+{   \
+ppc_vsr_t *at;  \
+float32 psum, aux_acc, va, vb, vc, vd;  \
+int i, j, xmsk_bit, ymsk_bit;   \
+uint8_t xmsk = mask & 0x0F; \
+uint8_t ymsk = (mask >> 4) & 0x0F;  \
+uint8_t pmsk = (mask >> 8) & 0x3;   \
+ppc_vsr_t *b = cpu_vsr_ptr(env, b_r);   \
+ppc_vsr_t *a = cpu_vsr_ptr(env, a_r);   \
+float_status *excp_ptr = &env->fp_status;   \
+bool acc = ger_acc_flag(packed_flags);  \
+bool neg_acc = ger_neg_acc_flag(packed_flags);  \
+bool neg_mul = ger_neg_mul_flag(packed_flags);  \
+for (i = 0, xmsk_bit = 1 << 3; i < 4; i++, xmsk_bit >>= 1) {\
+at = cpu_vsr_ptr(env, at_r + i);\
+for (j = 0, ymsk_bit = 1 << 3; j < 4; j++, ymsk_bit >>= 1) {\
+if ((xmsk_bit & xmsk) && (ymsk_bit & ymsk)) {   \
+va = !(pmsk & 2) ? float32_zero :   \
+   GET_VSR(Vsr##OR_EL, a,   \
+   2 * i, ORIG_T, float32); \
+vb = !(pmsk & 2) ? float32_zero :   \
+   GET_VSR(Vsr##OR_EL, b,   \
+   2 * j, ORIG_T, float32); \
+vc = !(pmsk & 1) ? float32_zero :   \
+   GET_VSR(Vsr##OR_EL, a,   \
+2 * i + 1, ORIG_T, float32);\
+vd = !(pmsk & 1) ? float32_zero :   \
+   GET_VSR(Vsr##OR_EL, b,   \
+2 * j + 1, ORIG_T, float32);\
+psum = float32_mul(va, vb, excp_ptr);   \
+psum = float32_muladd(vc, vd, psum, 0, excp_ptr);   \


This isn't correct -- the intermediate 'prod' (the first multiply) is not rounded.  I 
think the correct way to implement this (barring new softfloat functions) is to compute 
the intermediate product as float64 with float_round_to_odd, then float64r32_muladd into 
the correct rounding mode to finish.



+if (acc) {  \
+if (neg_mul) {  \
+psum = float32_neg(psum);   \
+}   \
+if (neg_acc) {  \
+aux_acc = float32_neg(at->VsrSF(j));\
+} else {\
+aux_acc = at->VsrSF(j); \
+}   \
+at->VsrSF(j) = float32_add(psum, aux_acc,   \
+   excp_ptr);   \


This one, thankfully, uses the rounded intermediate result 'msum', so is ok.

Please do convert this from a macro.  Given that float16 and bfloat16 are addressed the 
same, I think the only callback you need is the conversion from float16_to_float64.  Drop 
the bf16 accessor to ppc_vsr_t.



r~



Re: [RFC PATCH 4/7] target/ppc: Implemented xvf*ger*

2022-04-26 Thread Richard Henderson

On 4/26/22 05:50, Lucas Mateus Castro(alqotel) wrote:

+#define VSXGER(NAME, TYPE, EL)  \
+void NAME(CPUPPCState *env, uint32_t a_r, uint32_t b_r, \
+  uint32_t  at_r, uint32_t mask, uint32_t packed_flags) \
+{   \
+ppc_vsr_t *a, *b, *at;  \
+TYPE aux_acc, va, vb;   \
+int i, j, xmsk_bit, ymsk_bit, op_flags; \
+uint8_t xmsk = mask & 0x0F; \
+uint8_t ymsk = (mask >> 4) & 0x0F;  \
+int ymax = MIN(4, 128 / (sizeof(TYPE) * 8));\
+b = cpu_vsr_ptr(env, b_r);  \
+float_status *excp_ptr = &env->fp_status;   \
+bool acc = ger_acc_flag(packed_flags);  \
+bool neg_acc = ger_neg_acc_flag(packed_flags);  \
+bool neg_mul = ger_neg_mul_flag(packed_flags);  \
+helper_reset_fpstatus(env); \
+for (i = 0, xmsk_bit = 1 << 3; i < 4; i++, xmsk_bit >>= 1) {\
+a = cpu_vsr_ptr(env, a_r + i / ymax);   \
+at = cpu_vsr_ptr(env, at_r + i);\
+for (j = 0, ymsk_bit = 1 << (ymax - 1); j < ymax;   \
+ j++, ymsk_bit >>= 1) { \
+if ((xmsk_bit & xmsk) && (ymsk_bit & ymsk)) {   \
+op_flags = (neg_acc ^ neg_mul) ?\
+  float_muladd_negate_c : 0;\
+op_flags |= (neg_mul) ? \
+ float_muladd_negate_result : 0;\


There's no need to compute op_flags in the inner loop.
Indeed, probably better to compute it in translation.

This macro is trickier than the integer to turn into a function, however,


+va = a->Vsr##EL(i % ymax);  \
+vb = b->Vsr##EL(j); \
+aux_acc = at->Vsr##EL(j);   \
+if (acc) {  \
+at->Vsr##EL(j) = TYPE##_muladd(va, vb, aux_acc, \
+   op_flags,\
+   excp_ptr);   \
+} else {\
+at->Vsr##EL(j) = TYPE##_mul(va, vb, excp_ptr);  \
+}   \
+} else {\
+at->Vsr##EL(j) = 0; \
+}   \


static void vsxger_zero_f(ppc_vsr_t *a, int j)
{
a->VsrSF(i) = float32_zero;
}

static uint64_t vsxger_mul_f(ppc_vsr_t *d, ppc_vsr_t *a, ppc_vsr_t *b,
 int i, int j, int flags, float_status *s)
{
float32 af = a->VsrSF(i);
float32 bf = b->VsrSF(j);
d->VsrSF(j) = float32_mul(af, bf, s);
}

static uint64_t vsxger_mac_f(ppc_vsr_t *d, ppc_vsr_t *a, ppc_vsr_t *b,
 int i, int j, int flags, float_status *s)
{
float32 af = a->VsrSF(i);
float32 bf = b->VsrSF(j);
float32 cf = d->VsrSF(j);
d->VsrSF(j) = float32_muladd(af, bf, cf, flags, s);
}

is probably a good place to start for callbacks.


r~



Re: [RFC PATCH 2/7] target/ppc: Implemented xvi*ger* instructions

2022-04-26 Thread Richard Henderson

On 4/26/22 05:50, Lucas Mateus Castro(alqotel) wrote:

+%xx_at  23:3 !function=times_4
+@XX3_at .. ... .. . .  ...  &XX3 xt=%xx_at 
xb=%xx_xb


Hmm.  Depends, I suppose on whether you want acc[0-7] or vsr[0-28]


+/*
+ * Packed VSX Integer GER Flags
+ * 00 - no accumulation no saturation
+ * 01 - accumulate but no saturation
+ * 10 - no accumulation but with saturation
+ * 11 - accumulate with saturation
+ */
+static inline bool get_sat(uint32_t flags)
+{
+return flags & 0x2;
+}
+
+static inline bool get_acc(uint32_t flags)
+{
+return flags & 0x1;
+}


Better to have separate helpers for these?  They'd be immediate operands to the function 
replacing XVIGER (see below) and thus optimize well.



+#define GET_VsrN(a, i) (extract32(a->VsrB((i) / 2), (i) % 2 ? 4 : 0, 4))
+#define GET_VsrB(a, i) a->VsrB(i)
+#define GET_VsrH(a, i) a->VsrH(i)
+
+#define GET_VsrSN(a, i) (sextract32(a->VsrSB((i) / 2), (i) % 2 ? 4 : 0, 4))
+#define GET_VsrSB(a, i) a->VsrSB(i)
+#define GET_VsrSH(a, i) a->VsrSH(i)


These can be made into functions of the form

typedef int32_t xviger_extract(ppc_vsr_t *a, int i);



+#define XVIGER(NAME, RANK, EL) 
\
+void NAME(CPUPPCState *env, uint32_t a_r, uint32_t b_r,
\
+  uint32_t  at_r, uint32_t mask, uint32_t packed_flags)
\
+{  
\
+ppc_vsr_t *a = cpu_vsr_ptr(env, a_r), *b = cpu_vsr_ptr(env, b_r), *at; 
\
+bool sat = get_sat(packed_flags), acc = get_acc(packed_flags); 
\
+uint8_t pmsk = ger_get_pmsk(mask), xmsk = ger_get_xmsk(mask),  
\
+ymsk = ger_get_ymsk(mask); 
\
+uint8_t pmsk_bit, xmsk_bit, ymsk_bit;  
\
+int64_t psum;  
\
+int32_t va, vb;
\
+int i, j, k;   
\
+for (i = 0, xmsk_bit = 1 << 3; i < 4; i++, xmsk_bit >>= 1) {   
\
+at = cpu_vsr_ptr(env, at_r + i);   
\
+for (j = 0, ymsk_bit = 1 << 3; j < 4; j++, ymsk_bit >>= 1) {   
\
+if ((xmsk_bit & xmsk) && (ymsk_bit & ymsk)) {  
\
+psum = 0;  
\
+for (k = 0, pmsk_bit = 1 << (RANK - 1); k < RANK;  
\
+ k++, pmsk_bit >>= 1) {
\
+if (pmsk_bit & pmsk) { 
\
+va = (int32_t)GET_VsrS##EL(a, RANK * i + k);   
\
+vb = (int32_t) ((RANK == 4) ?  
\
+GET_Vsr##EL(b, RANK * j + k) : 
\
+GET_VsrS##EL(b, RANK * j + 
k));\
+psum += va * vb;   
\
+}  
\
+}  
\
+if (acc) { 
\
+psum += at->VsrSW(j);  
\
+}  
\
+if (sat && psum > INT32_MAX) { 
\
+set_vscr_sat(env); 
\
+at->VsrSW(j) = INT32_MAX;  
\
+} else if (sat && psum < INT32_MIN) {  
\
+set_vscr_sat(env); 
\
+at->VsrSW(j) = INT32_MIN;  
\
+} else {   
\
+at->VsrSW(j) = (int32_t) psum; 
\
+}  
\
+} else {   
\
+at->VsrSW(j) = 0;  
\
+}  
\
+}  
\
+}  
\
+}


... which means that this monster can be a function instead of a non-debuggable 
macro.


diff --git a/target/ppc/internal.h b/target/ppc/internal.h
index 8094e0b033..a994d982

Re: [PATCH v10 7/7] multifd: Implement zero copy write in multifd migration (multifd-zero-copy)

2022-04-26 Thread Peter Xu
On Tue, Apr 26, 2022 at 08:06:56PM -0300, Leonardo Bras wrote:
> Implement zero copy send on nocomp_send_write(), by making use of QIOChannel
> writev + flags & flush interface.
> 
> Change multifd_send_sync_main() so flush_zero_copy() can be called
> after each iteration in order to make sure all dirty pages are sent before
> a new iteration is started. It will also flush at the beginning and at the
> end of migration.
> 
> Also make it return -1 if flush_zero_copy() fails, in order to cancel
> the migration process, and avoid resuming the guest in the target host
> without receiving all current RAM.
> 
> This will work fine on RAM migration because the RAM pages are not usually 
> freed,
> and there is no problem on changing the pages content between 
> writev_zero_copy() and
> the actual sending of the buffer, because this change will dirty the page and
> cause it to be re-sent on a next iteration anyway.
> 
> A lot of locked memory may be needed in order to use multifd migration
> with zero-copy enabled, so disabling the feature should be necessary for
> low-privileged users trying to perform multifd migrations.
> 
> Signed-off-by: Leonardo Bras 

Reviewed-by: Peter Xu 

-- 
Peter Xu




Re: [PATCH v10 6/7] multifd: Send header packet without flags if zero-copy-send is enabled

2022-04-26 Thread Peter Xu
On Tue, Apr 26, 2022 at 08:06:55PM -0300, Leonardo Bras wrote:
> Since d48c3a0445 ("multifd: Use a single writev on the send side"),
> sending the header packet and the memory pages happens in the same
> writev, which can potentially make the migration faster.
> 
> Using channel-socket as example, this works well with the default copying
> mechanism of sendmsg(), but with zero-copy-send=true, it will cause
> the migration to often break.
> 
> This happens because the header packet buffer gets reused quite often,
> and there is a high chance that by the time the MSG_ZEROCOPY mechanism get
> to send the buffer, it has already changed, sending the wrong data and
> causing the migration to abort.
> 
> It means that, as it is, the buffer for the header packet is not suitable
> for sending with MSG_ZEROCOPY.
> 
> In order to enable zero copy for multifd, send the header packet on an
> individual write(), without any flags, and the remanining pages with a
> writev(), as it was happening before. This only changes how a migration
> with zero-copy-send=true works, not changing any current behavior for
> migrations with zero-copy-send=false.
> 
> Signed-off-by: Leonardo Bras 
> ---
>  migration/multifd.c | 23 ---
>  1 file changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 15fb668e64..07b2e92d8d 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -617,6 +617,7 @@ static void *multifd_send_thread(void *opaque)
>  MultiFDSendParams *p = opaque;
>  Error *local_err = NULL;
>  int ret = 0;
> +bool use_zero_copy_send = migrate_use_zero_copy_send();
>  
>  trace_multifd_send_thread_start(p->id);
>  rcu_register_thread();
> @@ -639,9 +640,14 @@ static void *multifd_send_thread(void *opaque)
>  if (p->pending_job) {
>  uint64_t packet_num = p->packet_num;
>  uint32_t flags = p->flags;
> -p->iovs_num = 1;
>  p->normal_num = 0;
>  
> +if (use_zero_copy_send) {
> +p->iovs_num = 0;
> +} else {
> +p->iovs_num = 1;
> +}
> +
>  for (int i = 0; i < p->pages->num; i++) {
>  p->normal[p->normal_num] = p->pages->offset[i];
>  p->normal_num++;
> @@ -665,8 +671,19 @@ static void *multifd_send_thread(void *opaque)
>  trace_multifd_send(p->id, packet_num, p->normal_num, flags,
> p->next_packet_size);
>  
> -p->iov[0].iov_len = p->packet_len;
> -p->iov[0].iov_base = p->packet;
> +if (use_zero_copy_send) {
> +/* Send header first, without zerocopy */
> +ret = qio_channel_write_all(p->c, (void *)p->packet,
> +p->packet_len, &local_err);
> +if (ret != 0) {
> +break;
> +}
> +

Extra but useless newline.. but not worth a repost.  Looks good here:

Reviewed-by: Peter Xu 

Thanks,

> +} else {
> +/* Send header using the same writev call */
> +p->iov[0].iov_len = p->packet_len;
> +p->iov[0].iov_base = p->packet;
> +}
>  
>  ret = qio_channel_writev_all(p->c, p->iov, p->iovs_num,
>   &local_err);
> -- 
> 2.36.0
> 

-- 
Peter Xu




Re: [RFC PATCH 3/4] net: slirp: add support for CFI-friendly timer API

2022-04-26 Thread Samuel Thibault
Paolo Bonzini, le mar. 12 avril 2022 14:13:36 +0200, a ecrit:
> libslirp 4.7 introduces a CFI-friendly version of the .timer_new callback.
> The new callback replaces the function pointer with an enum; invoking the
> callback is done with a new function slirp_handle_timer.
> 
> Support the new API so that CFI can be made compatible with using a system
> libslirp.
> 
> Signed-off-by: Paolo Bonzini 

Reviewed-by: Samuel Thibault 

> ---
>  net/slirp.c | 41 -
>  1 file changed, 40 insertions(+), 1 deletion(-)
> 
> diff --git a/net/slirp.c b/net/slirp.c
> index b3a92d6e38..57af42299d 100644
> --- a/net/slirp.c
> +++ b/net/slirp.c
> @@ -184,10 +184,43 @@ static int64_t net_slirp_clock_get_ns(void *opaque)
>  return qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
>  }
>  
> +typedef struct SlirpTimer SlirpTimer;
>  struct SlirpTimer {
>  QEMUTimer timer;
> +#if SLIRP_CHECK_VERSION(4,7,0)
> +Slirp *slirp;
> +SlirpTimerId id;
> +void *cb_opaque;
> +#endif
> +};
> +
> +#if SLIRP_CHECK_VERSION(4,7,0)
> +static void net_slirp_init_completed(Slirp *slirp, void *opaque)
> +{
> +SlirpState *s = opaque;
> +s->slirp = slirp;
>  }
>  
> +static void net_slirp_timer_cb(void *opaque)
> +{
> +SlirpTimer *t = opaque;
> +slirp_handle_timer(t->slirp, t->id, t->cb_opaque);
> +}
> +
> +static void *net_slirp_timer_new_opaque(SlirpTimerId id,
> +void *cb_opaque, void *opaque)
> +{
> +SlirpState *s = opaque;
> +SlirpTimer *t = g_new(SlirpTimer, 1);
> +t->slirp = s->slirp;
> +t->id = id;
> +t->cb_opaque = cb_opaque;
> +timer_init_full(&t->timer, NULL, QEMU_CLOCK_VIRTUAL,
> +SCALE_MS, QEMU_TIMER_ATTR_EXTERNAL,
> +net_slirp_timer_cb, t);
> +return t;
> +}
> +#else
>  static void *net_slirp_timer_new(SlirpTimerCb cb,
>   void *cb_opaque, void *opaque)
>  {
> @@ -197,6 +230,7 @@ static void *net_slirp_timer_new(SlirpTimerCb cb,
>  cb, cb_opaque);
>  return t;
>  }
> +#endif
>  
>  static void net_slirp_timer_free(void *timer, void *opaque)
>  {
> @@ -231,7 +265,12 @@ static const SlirpCb slirp_cb = {
>  .send_packet = net_slirp_send_packet,
>  .guest_error = net_slirp_guest_error,
>  .clock_get_ns = net_slirp_clock_get_ns,
> +#if SLIRP_CHECK_VERSION(4,7,0)
> +.init_completed = net_slirp_init_completed,
> +.timer_new_opaque = net_slirp_timer_new_opaque,
> +#else
>  .timer_new = net_slirp_timer_new,
> +#endif
>  .timer_free = net_slirp_timer_free,
>  .timer_mod = net_slirp_timer_mod,
>  .register_poll_fd = net_slirp_register_poll_fd,
> @@ -578,7 +617,7 @@ static int net_slirp_init(NetClientState *peer, const 
> char *model,
>  
>  s = DO_UPCAST(SlirpState, nc, nc);
>  
> -cfg.version = 3;
> +cfg.version = SLIRP_CHECK_VERSION(4,7,0) ? 4 : 3;
>  cfg.restricted = restricted;
>  cfg.in_enabled = ipv4;
>  cfg.vnetwork = net;
> -- 
> 2.35.1
> 
> 

-- 
Samuel
---
Pour une évaluation indépendante, transparente et rigoureuse !
Je soutiens la Commission d'Évaluation de l'Inria.



Re: [RFC PATCH 2/4] net: slirp: switch to slirp_new

2022-04-26 Thread Samuel Thibault
Paolo Bonzini, le mar. 12 avril 2022 14:13:35 +0200, a ecrit:
> Replace slirp_init with slirp_new, so that a more recent cfg.version
> can be specified.
> 
> Signed-off-by: Paolo Bonzini 

Reviewed-by: Samuel Thibault 

> ---
>  net/slirp.c | 27 +--
>  1 file changed, 21 insertions(+), 6 deletions(-)
> 
> diff --git a/net/slirp.c b/net/slirp.c
> index f1e25d741f..b3a92d6e38 100644
> --- a/net/slirp.c
> +++ b/net/slirp.c
> @@ -389,6 +389,7 @@ static int net_slirp_init(NetClientState *peer, const 
> char *model,
>  #if defined(CONFIG_SMBD_COMMAND)
>  struct in_addr smbsrv = { .s_addr = 0 };
>  #endif
> +SlirpConfig cfg = { 0 };
>  NetClientState *nc;
>  SlirpState *s;
>  char buf[20];
> @@ -577,12 +578,26 @@ static int net_slirp_init(NetClientState *peer, const 
> char *model,
>  
>  s = DO_UPCAST(SlirpState, nc, nc);
>  
> -s->slirp = slirp_init(restricted, ipv4, net, mask, host,
> -  ipv6, ip6_prefix, vprefix6_len, ip6_host,
> -  vhostname, tftp_server_name,
> -  tftp_export, bootfile, dhcp,
> -  dns, ip6_dns, dnssearch, vdomainname,
> -  &slirp_cb, s);
> +cfg.version = 3;
> +cfg.restricted = restricted;
> +cfg.in_enabled = ipv4;
> +cfg.vnetwork = net;
> +cfg.vnetmask = mask;
> +cfg.vhost = host;
> +cfg.in6_enabled = ipv6;
> +cfg.vprefix_addr6 = ip6_prefix;
> +cfg.vprefix_len = vprefix6_len;
> +cfg.vhost6 = ip6_host;
> +cfg.vhostname = vhostname;
> +cfg.tftp_server_name = tftp_server_name;
> +cfg.tftp_path = tftp_export;
> +cfg.bootfile = bootfile;
> +cfg.vdhcp_start = dhcp;
> +cfg.vnameserver = dns;
> +cfg.vnameserver6 = ip6_dns;
> +cfg.vdnssearch = dnssearch;
> +cfg.vdomainname = vdomainname;
> +s->slirp = slirp_new(&cfg, &slirp_cb, s);
>  QTAILQ_INSERT_TAIL(&slirp_stacks, s, entry);
>  
>  /*
> -- 
> 2.35.1
> 
> 

-- 
Samuel
---
Pour une évaluation indépendante, transparente et rigoureuse !
Je soutiens la Commission d'Évaluation de l'Inria.



Re: [RFC PATCH 4/4] net: slirp: allow CFI with libslirp >= 4.7

2022-04-26 Thread Samuel Thibault
Paolo Bonzini, le mar. 12 avril 2022 14:13:37 +0200, a ecrit:
> slirp 4.7 introduces a new CFI-friendly timer callback that does
> not pass function pointers within libslirp as callbacks for timers.
> Check the version number and, if it is new enough, allow using CFI
> even with a system libslirp.
> 
> Signed-off-by: Paolo Bonzini 

Reviewed-by: Samuel Thibault 

> ---
>  meson.build | 24 
>  1 file changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/meson.build b/meson.build
> index 861de93c4f..92a83580a3 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -2485,21 +2485,21 @@ if have_system
>  slirp = declare_dependency(link_with: libslirp,
> dependencies: slirp_deps,
> include_directories: slirp_inc)
> +  else
> +# slirp <4.7 is incompatible with CFI support in QEMU.  This is because
> +# it passes function pointers within libslirp as callbacks for timers.
> +# When using a system-wide shared libslirp, the type information for the
> +# callback is missing and the timer call produces a false positive with 
> CFI.
> +#
> +# Now that slirp_opt has been defined, check if the selected slirp is 
> compatible
> +# with control-flow integrity.
> +if get_option('cfi') and slirp.found() and 
> slirp.version().version_compare('<4.7')
> +  error('Control-Flow Integrity is not compatible with system-wide 
> slirp.' \
> + + ' Please configure with --enable-slirp=git or upgrade to 
> libslirp 4.7')
> +endif
>endif
>  endif
>  
> -# For CFI, we need to compile slirp as a static library together with qemu.
> -# This is because we register slirp functions as callbacks for QEMU Timers.
> -# When using a system-wide shared libslirp, the type information for the
> -# callback is missing and the timer call produces a false positive with CFI.
> -#
> -# Now that slirp_opt has been defined, check if the selected slirp is 
> compatible
> -# with control-flow integrity.
> -if get_option('cfi') and slirp_opt == 'system'
> -  error('Control-Flow Integrity is not compatible with system-wide slirp.' \
> - + ' Please configure with --enable-slirp=git')
> -endif
> -
>  fdt = not_found
>  if have_system
>fdt_opt = get_option('fdt')
> -- 
> 2.35.1
> 

-- 
Samuel
---
Pour une évaluation indépendante, transparente et rigoureuse !
Je soutiens la Commission d'Évaluation de l'Inria.



[PATCH v2 3/5] hw/intc: Support migration of aclint device

2022-04-26 Thread Atish Patra
mtimecmp is part of ACLINT device now. This needs to preserved across
migration.

Signed-off-by: Atish Patra 
---
 hw/intc/riscv_aclint.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/hw/intc/riscv_aclint.c b/hw/intc/riscv_aclint.c
index 1bddb99bda47..0bbc5e65737e 100644
--- a/hw/intc/riscv_aclint.c
+++ b/hw/intc/riscv_aclint.c
@@ -32,6 +32,7 @@
 #include "hw/intc/riscv_aclint.h"
 #include "qemu/timer.h"
 #include "hw/irq.h"
+#include "migration/vmstate.h"
 
 typedef struct riscv_aclint_mtimer_callback {
 RISCVAclintMTimerState *s;
@@ -311,6 +312,18 @@ static void riscv_aclint_mtimer_reset_enter(Object *obj, 
ResetType type)
 riscv_aclint_mtimer_write(mtimer, mtimer->time_base, 0, 8);
 }
 
+static const VMStateDescription vmstate_riscv_mtimer = {
+.name = "riscv_mtimer",
+.version_id = 1,
+.minimum_version_id = 1,
+.fields = (VMStateField[]) {
+VMSTATE_VARRAY_UINT64(timecmp, RISCVAclintMTimerState,
+  num_harts, 0,
+  vmstate_info_uint64, uint64_t),
+VMSTATE_END_OF_LIST()
+}
+};
+
 static void riscv_aclint_mtimer_class_init(ObjectClass *klass, void *data)
 {
 DeviceClass *dc = DEVICE_CLASS(klass);
@@ -318,6 +331,7 @@ static void riscv_aclint_mtimer_class_init(ObjectClass 
*klass, void *data)
 device_class_set_props(dc, riscv_aclint_mtimer_properties);
 ResettableClass *rc = RESETTABLE_CLASS(klass);
 rc->phases.enter = riscv_aclint_mtimer_reset_enter;
+dc->vmsd = &vmstate_riscv_mtimer;
 }
 
 static const TypeInfo riscv_aclint_mtimer_info = {
-- 
2.25.1




[PATCH v2 0/5] Implement Sstc extension

2022-04-26 Thread Atish Patra
This series implements Sstc extension[1] which was ratified recently.

The first patch is a prepartory patches while PATCH 2 adds stimecmp
support while PATCH 3 adds vstimecmp support. This series is based on
on top of upstream commit (faee5441a038).

The series can also be found at
https://github.com/atishp04/qemu/tree/sstc_v2

It is tested on RV32 & RV64 with additional OpenSBI[2] & Linux kernel[3]
patches.

Changes from v1->v2:
1. Rebased on the latest upstream commit.
2. Replaced PATCH 1 with another patch where mtimer/timecmp is
   moved from CPU to ACLINT.
3. Added ACLINT migration support.

[1] https://drive.google.com/file/d/1m84Re2yK8m_vbW7TspvevCDR82MOBaSX/view
[2] https://github.com/atishp04/opensbi/tree/sstc_v2
[3] https://github.com/atishp04/linux/tree/sstc_v3

Atish Patra (5):
hw/intc: Move mtimer/mtimecmp to aclint
migration: Add 64bit variable array data type
hw/intc: Support migration of aclint device
target/riscv: Add stimecmp support
target/riscv: Add vstimecmp support

hw/intc/riscv_aclint.c |  41 +---
hw/timer/ibex_timer.c  |  20 ++--
include/hw/intc/riscv_aclint.h |   2 +
include/hw/timer/ibex_timer.h  |   2 +
include/migration/vmstate.h|  11 ++
migration/vmstate.c|   2 +
target/riscv/cpu.c |   8 ++
target/riscv/cpu.h |  10 +-
target/riscv/cpu_bits.h|   8 ++
target/riscv/cpu_helper.c  |  11 +-
target/riscv/csr.c | 181 +
target/riscv/machine.c |   3 +-
target/riscv/meson.build   |   3 +-
target/riscv/time_helper.c | 114 +
target/riscv/time_helper.h |  30 ++
15 files changed, 415 insertions(+), 31 deletions(-)
create mode 100644 target/riscv/time_helper.c
create mode 100644 target/riscv/time_helper.h

--
2.25.1




[PATCH v10 6/7] multifd: Send header packet without flags if zero-copy-send is enabled

2022-04-26 Thread Leonardo Bras
Since d48c3a0445 ("multifd: Use a single writev on the send side"),
sending the header packet and the memory pages happens in the same
writev, which can potentially make the migration faster.

Using channel-socket as example, this works well with the default copying
mechanism of sendmsg(), but with zero-copy-send=true, it will cause
the migration to often break.

This happens because the header packet buffer gets reused quite often,
and there is a high chance that by the time the MSG_ZEROCOPY mechanism get
to send the buffer, it has already changed, sending the wrong data and
causing the migration to abort.

It means that, as it is, the buffer for the header packet is not suitable
for sending with MSG_ZEROCOPY.

In order to enable zero copy for multifd, send the header packet on an
individual write(), without any flags, and the remanining pages with a
writev(), as it was happening before. This only changes how a migration
with zero-copy-send=true works, not changing any current behavior for
migrations with zero-copy-send=false.

Signed-off-by: Leonardo Bras 
---
 migration/multifd.c | 23 ---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 15fb668e64..07b2e92d8d 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -617,6 +617,7 @@ static void *multifd_send_thread(void *opaque)
 MultiFDSendParams *p = opaque;
 Error *local_err = NULL;
 int ret = 0;
+bool use_zero_copy_send = migrate_use_zero_copy_send();
 
 trace_multifd_send_thread_start(p->id);
 rcu_register_thread();
@@ -639,9 +640,14 @@ static void *multifd_send_thread(void *opaque)
 if (p->pending_job) {
 uint64_t packet_num = p->packet_num;
 uint32_t flags = p->flags;
-p->iovs_num = 1;
 p->normal_num = 0;
 
+if (use_zero_copy_send) {
+p->iovs_num = 0;
+} else {
+p->iovs_num = 1;
+}
+
 for (int i = 0; i < p->pages->num; i++) {
 p->normal[p->normal_num] = p->pages->offset[i];
 p->normal_num++;
@@ -665,8 +671,19 @@ static void *multifd_send_thread(void *opaque)
 trace_multifd_send(p->id, packet_num, p->normal_num, flags,
p->next_packet_size);
 
-p->iov[0].iov_len = p->packet_len;
-p->iov[0].iov_base = p->packet;
+if (use_zero_copy_send) {
+/* Send header first, without zerocopy */
+ret = qio_channel_write_all(p->c, (void *)p->packet,
+p->packet_len, &local_err);
+if (ret != 0) {
+break;
+}
+
+} else {
+/* Send header using the same writev call */
+p->iov[0].iov_len = p->packet_len;
+p->iov[0].iov_base = p->packet;
+}
 
 ret = qio_channel_writev_all(p->c, p->iov, p->iovs_num,
  &local_err);
-- 
2.36.0




[PATCH v2 5/5] target/riscv: Add vstimecmp support

2022-04-26 Thread Atish Patra
vstimecmp CSR allows the guest OS or to program the next guest timer
interrupt directly. Thus, hypervisor no longer need to inject the
timer interrupt to the guest if vstimecmp is used. This was ratified
as a part of the Sstc extension.

Signed-off-by: Atish Patra 
---
 target/riscv/cpu.h |   3 ++
 target/riscv/cpu_bits.h|   4 ++
 target/riscv/cpu_helper.c  |  11 ++--
 target/riscv/csr.c | 102 -
 target/riscv/machine.c |   1 +
 target/riscv/time_helper.c |  16 ++
 6 files changed, 132 insertions(+), 5 deletions(-)

diff --git a/target/riscv/cpu.h b/target/riscv/cpu.h
index 9a5e02f217ba..29d8ab1aaca6 100644
--- a/target/riscv/cpu.h
+++ b/target/riscv/cpu.h
@@ -278,6 +278,7 @@ struct CPUArchState {
 
 /* Sstc CSRs */
 uint64_t stimecmp;
+uint64_t vstimecmp;
 
 /* physical memory protection */
 pmp_table_t pmp_state;
@@ -333,6 +334,8 @@ struct CPUArchState {
 
 /* Fields from here on are preserved across CPU reset. */
 QEMUTimer *stimer; /* Internal timer for S-mode interrupt */
+QEMUTimer *vstimer; /* Internal timer for VS-mode interrupt */
+bool vstime_irq;
 
 hwaddr kernel_addr;
 hwaddr fdt_addr;
diff --git a/target/riscv/cpu_bits.h b/target/riscv/cpu_bits.h
index 631b7c32b38f..ce33bf8befbb 100644
--- a/target/riscv/cpu_bits.h
+++ b/target/riscv/cpu_bits.h
@@ -269,6 +269,10 @@
 #define CSR_VSIP0x244
 #define CSR_VSATP   0x280
 
+/* Sstc virtual CSRs */
+#define CSR_VSTIMECMP   0x24D
+#define CSR_VSTIMECMPH  0x25D
+
 #define CSR_MTINST  0x34a
 #define CSR_MTVAL2  0x34b
 
diff --git a/target/riscv/cpu_helper.c b/target/riscv/cpu_helper.c
index e1aa4f2097c1..2715021c022e 100644
--- a/target/riscv/cpu_helper.c
+++ b/target/riscv/cpu_helper.c
@@ -344,8 +344,9 @@ static uint64_t riscv_cpu_all_pending(CPURISCVState *env)
 {
 uint32_t gein = get_field(env->hstatus, HSTATUS_VGEIN);
 uint64_t vsgein = (env->hgeip & (1ULL << gein)) ? MIP_VSEIP : 0;
+uint64_t vstip = (env->vstime_irq) ? MIP_VSTIP : 0;
 
-return (env->mip | vsgein) & env->mie;
+return (env->mip | vsgein | vstip) & env->mie;
 }
 
 int riscv_cpu_mirq_pending(CPURISCVState *env)
@@ -604,7 +605,7 @@ uint64_t riscv_cpu_update_mip(RISCVCPU *cpu, uint64_t mask, 
uint64_t value)
 {
 CPURISCVState *env = &cpu->env;
 CPUState *cs = CPU(cpu);
-uint64_t gein, vsgein = 0, old = env->mip;
+uint64_t gein, vsgein = 0, vstip = 0, old = env->mip;
 bool locked = false;
 
 if (riscv_cpu_virt_enabled(env)) {
@@ -612,6 +613,10 @@ uint64_t riscv_cpu_update_mip(RISCVCPU *cpu, uint64_t 
mask, uint64_t value)
 vsgein = (env->hgeip & (1ULL << gein)) ? MIP_VSEIP : 0;
 }
 
+/* No need to update mip for VSTIP */
+mask = ((mask == MIP_VSTIP) && env->vstime_irq) ? 0 : mask;
+vstip = env->vstime_irq ? MIP_VSTIP : 0;
+
 if (!qemu_mutex_iothread_locked()) {
 locked = true;
 qemu_mutex_lock_iothread();
@@ -619,7 +624,7 @@ uint64_t riscv_cpu_update_mip(RISCVCPU *cpu, uint64_t mask, 
uint64_t value)
 
 env->mip = (env->mip & ~mask) | (value & mask);
 
-if (env->mip | vsgein) {
+if (env->mip | vsgein | vstip) {
 cpu_interrupt(cs, CPU_INTERRUPT_HARD);
 } else {
 cpu_reset_interrupt(cs, CPU_INTERRUPT_HARD);
diff --git a/target/riscv/csr.c b/target/riscv/csr.c
index c51c05d2ea74..c96c12a8ac74 100644
--- a/target/riscv/csr.c
+++ b/target/riscv/csr.c
@@ -567,17 +567,100 @@ static RISCVException sstc(CPURISCVState *env, int csrno)
 return RISCV_EXCP_NONE;
 }
 
+static RISCVException sstc_hmode(CPURISCVState *env, int csrno)
+{
+CPUState *cs = env_cpu(env);
+RISCVCPU *cpu = RISCV_CPU(cs);
+
+if (!cpu->cfg.ext_sstc || !env->rdtime_fn) {
+return RISCV_EXCP_ILLEGAL_INST;
+}
+
+if (env->priv == PRV_M) {
+return RISCV_EXCP_NONE;
+}
+
+if (!(get_field(env->mcounteren, COUNTEREN_TM) &
+  get_field(env->menvcfg, MENVCFG_STCE))) {
+return RISCV_EXCP_ILLEGAL_INST;
+}
+
+if (!(get_field(env->hcounteren, COUNTEREN_TM) &
+  get_field(env->henvcfg, HENVCFG_STCE))) {
+return RISCV_EXCP_VIRT_INSTRUCTION_FAULT;
+}
+
+return RISCV_EXCP_NONE;
+}
+
+static RISCVException read_vstimecmp(CPURISCVState *env, int csrno,
+target_ulong *val)
+{
+*val = env->vstimecmp;
+
+return RISCV_EXCP_NONE;
+}
+
+static RISCVException read_vstimecmph(CPURISCVState *env, int csrno,
+target_ulong *val)
+{
+*val = env->vstimecmp >> 32;
+
+return RISCV_EXCP_NONE;
+}
+
+static RISCVException write_vstimecmp(CPURISCVState *env, int csrno,
+target_ulong val)
+{
+RISCVCPU *cpu = env_archcpu(env);
+
+if (riscv_cpu_mxl(env) == MXL_RV32) {
+uint64_t vstimecmp_hi = env->vstimecmp >> 32;
+env->vstimecmp = (vstimecmp_hi << 32) | (val & 0xF

[PATCH v2 2/5] migration: Add 64bit variable array data type

2022-04-26 Thread Atish Patra
unsigned 64bit variable array data type support is missing.
Add the required code to support it.

Signed-off-by: Atish Patra 
---
 include/migration/vmstate.h | 11 +++
 migration/vmstate.c |  2 ++
 2 files changed, 13 insertions(+)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index ad24aa193451..d289caea0143 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -147,6 +147,7 @@ enum VMStateFlags {
  * VMStateField.struct_version_id to tell which version of the
  * structure we are referencing to use. */
 VMS_VSTRUCT   = 0x8000,
+VMS_VARRAY_UINT64 = 0x1,
 };
 
 typedef enum {
@@ -428,6 +429,16 @@ extern const VMStateInfo vmstate_info_qlist;
 .offset = vmstate_offset_pointer(_state, _field, _type), \
 }
 
+#define VMSTATE_VARRAY_UINT64(_field, _state, _field_num, _version, _info, 
_type) {\
+.name   = (stringify(_field)),   \
+.version_id = (_version),\
+.num_offset = vmstate_offset_value(_state, _field_num, uint32_t),\
+.info   = &(_info),  \
+.size   = sizeof(_type), \
+.flags  = VMS_VARRAY_UINT64 | VMS_POINTER,   \
+.offset = vmstate_offset_pointer(_state, _field, _type), \
+}
+
 #define VMSTATE_VARRAY_UINT16_ALLOC(_field, _state, _field_num, _version, 
_info, _type) {\
 .name   = (stringify(_field)),   \
 .version_id = (_version),\
diff --git a/migration/vmstate.c b/migration/vmstate.c
index 36ae8b9e1918..3cd5e37ebe2d 100644
--- a/migration/vmstate.c
+++ b/migration/vmstate.c
@@ -35,6 +35,8 @@ static int vmstate_n_elems(void *opaque, const VMStateField 
*field)
 n_elems = *(int32_t *)(opaque + field->num_offset);
 } else if (field->flags & VMS_VARRAY_UINT32) {
 n_elems = *(uint32_t *)(opaque + field->num_offset);
+} else if (field->flags & VMS_VARRAY_UINT64) {
+n_elems = *(uint64_t *)(opaque + field->num_offset);
 } else if (field->flags & VMS_VARRAY_UINT16) {
 n_elems = *(uint16_t *)(opaque + field->num_offset);
 } else if (field->flags & VMS_VARRAY_UINT8) {
-- 
2.25.1




[PATCH v10 4/7] migration: Add migrate_use_tls() helper

2022-04-26 Thread Leonardo Bras
A lot of places check parameters.tls_creds in order to evaluate if TLS is
in use, and sometimes call migrate_get_current() just for that test.

Add new helper function migrate_use_tls() in order to simplify testing
for TLS usage.

Signed-off-by: Leonardo Bras 
Reviewed-by: Juan Quintela 
Reviewed-by: Peter Xu 
Reviewed-by: Daniel P. Berrangé 
---
 migration/migration.h | 1 +
 migration/channel.c   | 3 +--
 migration/migration.c | 9 +
 migration/multifd.c   | 5 +
 4 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/migration/migration.h b/migration/migration.h
index e8f2941a55..485d58b95f 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -380,6 +380,7 @@ bool migrate_use_zero_copy_send(void);
 #else
 #define migrate_use_zero_copy_send() (false)
 #endif
+int migrate_use_tls(void);
 int migrate_use_xbzrle(void);
 uint64_t migrate_xbzrle_cache_size(void);
 bool migrate_colo_enabled(void);
diff --git a/migration/channel.c b/migration/channel.c
index c6a8dcf1d7..a162d00fea 100644
--- a/migration/channel.c
+++ b/migration/channel.c
@@ -38,8 +38,7 @@ void migration_channel_process_incoming(QIOChannel *ioc)
 trace_migration_set_incoming_channel(
 ioc, object_get_typename(OBJECT(ioc)));
 
-if (s->parameters.tls_creds &&
-*s->parameters.tls_creds &&
+if (migrate_use_tls() &&
 !object_dynamic_cast(OBJECT(ioc),
  TYPE_QIO_CHANNEL_TLS)) {
 migration_tls_channel_process_incoming(s, ioc, &local_err);
diff --git a/migration/migration.c b/migration/migration.c
index 3e91f4b5e2..4b6df2eb5e 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2588,6 +2588,15 @@ bool migrate_use_zero_copy_send(void)
 }
 #endif
 
+int migrate_use_tls(void)
+{
+MigrationState *s;
+
+s = migrate_get_current();
+
+return s->parameters.tls_creds && *s->parameters.tls_creds;
+}
+
 int migrate_use_xbzrle(void)
 {
 MigrationState *s;
diff --git a/migration/multifd.c b/migration/multifd.c
index 9ea4f581e2..2a8c8570c3 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -782,15 +782,12 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
 QIOChannel *ioc,
 Error *error)
 {
-MigrationState *s = migrate_get_current();
-
 trace_multifd_set_outgoing_channel(
 ioc, object_get_typename(OBJECT(ioc)),
 migrate_get_current()->hostname, error);
 
 if (!error) {
-if (s->parameters.tls_creds &&
-*s->parameters.tls_creds &&
+if (migrate_use_tls() &&
 !object_dynamic_cast(OBJECT(ioc),
  TYPE_QIO_CHANNEL_TLS)) {
 multifd_tls_channel_connect(p, ioc, &error);
-- 
2.36.0




Re: [RFC PATCH 1/4] net: slirp: introduce a wrapper struct for QemuTimer

2022-04-26 Thread Samuel Thibault
Paolo Bonzini, le mar. 12 avril 2022 14:13:34 +0200, a ecrit:
> This struct will be extended in the next few patches to support the
> new slirp_handle_timer() call.  For that we need to store an additional
> "int" for each SLIRP timer, in addition to the cb_opaque.
> 
> Signed-off-by: Paolo Bonzini 

Reviewed-by: Samuel Thibault 

> ---
>  net/slirp.c | 19 ++-
>  1 file changed, 14 insertions(+), 5 deletions(-)
> 
> diff --git a/net/slirp.c b/net/slirp.c
> index bc5e9e4f77..f1e25d741f 100644
> --- a/net/slirp.c
> +++ b/net/slirp.c
> @@ -184,23 +184,32 @@ static int64_t net_slirp_clock_get_ns(void *opaque)
>  return qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
>  }
>  
> +struct SlirpTimer {
> +QEMUTimer timer;
> +}
> +
>  static void *net_slirp_timer_new(SlirpTimerCb cb,
>   void *cb_opaque, void *opaque)
>  {
> -return timer_new_full(NULL, QEMU_CLOCK_VIRTUAL,
> -  SCALE_MS, QEMU_TIMER_ATTR_EXTERNAL,
> -  cb, cb_opaque);
> +SlirpTimer *t = g_new(SlirpTimer, 1);
> +timer_init_full(&t->timer, NULL, QEMU_CLOCK_VIRTUAL,
> +SCALE_MS, QEMU_TIMER_ATTR_EXTERNAL,
> +cb, cb_opaque);
> +return t;
>  }
>  
>  static void net_slirp_timer_free(void *timer, void *opaque)
>  {
> -timer_free(timer);
> +SlirpTimer *t = timer;
> +timer_del(&t->timer);
> +g_free(t);
>  }
>  
>  static void net_slirp_timer_mod(void *timer, int64_t expire_timer,
>  void *opaque)
>  {
> -timer_mod(timer, expire_timer);
> +SlirpTimer *t = timer;
> +timer_mod(&t->timer, expire_timer);
>  }
>  
>  static void net_slirp_register_poll_fd(int fd, void *opaque)
> -- 
> 2.35.1
> 
> 

-- 
Samuel
---
Pour une évaluation indépendante, transparente et rigoureuse !
Je soutiens la Commission d'Évaluation de l'Inria.



[PATCH v2 1/5] hw/intc: Move mtimer/mtimecmp to aclint

2022-04-26 Thread Atish Patra
Historically, The mtimer/mtimecmp has been part of the CPU because
they are per hart entities. However, they actually belong to aclint
which is a MMIO device.

Move them to the ACLINT device. This also emulates the real hardware
more closely.

Signed-off-by: Atish Patra 
---
 hw/intc/riscv_aclint.c | 27 +++
 hw/timer/ibex_timer.c  | 20 
 include/hw/intc/riscv_aclint.h |  2 ++
 include/hw/timer/ibex_timer.h  |  2 ++
 target/riscv/cpu.h |  2 --
 target/riscv/machine.c |  1 -
 6 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/hw/intc/riscv_aclint.c b/hw/intc/riscv_aclint.c
index 0412edc98257..1bddb99bda47 100644
--- a/hw/intc/riscv_aclint.c
+++ b/hw/intc/riscv_aclint.c
@@ -65,8 +65,8 @@ static void 
riscv_aclint_mtimer_write_timecmp(RISCVAclintMTimerState *mtimer,
 
 uint64_t rtc_r = cpu_riscv_read_rtc(mtimer);
 
-cpu->env.timecmp = value;
-if (cpu->env.timecmp <= rtc_r) {
+mtimer->timecmp[hartid] = value;
+if (mtimer->timecmp[hartid] <= rtc_r) {
 /*
  * If we're setting an MTIMECMP value in the "past",
  * immediately raise the timer interrupt
@@ -77,7 +77,7 @@ static void 
riscv_aclint_mtimer_write_timecmp(RISCVAclintMTimerState *mtimer,
 
 /* otherwise, set up the future timer interrupt */
 qemu_irq_lower(mtimer->timer_irqs[hartid - mtimer->hartid_base]);
-diff = cpu->env.timecmp - rtc_r;
+diff = mtimer->timecmp[hartid] - rtc_r;
 /* back to ns (note args switched in muldiv64) */
 uint64_t ns_diff = muldiv64(diff, NANOSECONDS_PER_SECOND, timebase_freq);
 
@@ -102,7 +102,7 @@ static void 
riscv_aclint_mtimer_write_timecmp(RISCVAclintMTimerState *mtimer,
 next = MIN(next, INT64_MAX);
 }
 
-timer_mod(cpu->env.timer, next);
+timer_mod(mtimer->timers[hartid], next);
 }
 
 /*
@@ -133,11 +133,11 @@ static uint64_t riscv_aclint_mtimer_read(void *opaque, 
hwaddr addr,
   "aclint-mtimer: invalid hartid: %zu", hartid);
 } else if ((addr & 0x7) == 0) {
 /* timecmp_lo for RV32/RV64 or timecmp for RV64 */
-uint64_t timecmp = env->timecmp;
+uint64_t timecmp = mtimer->timecmp[hartid];
 return (size == 4) ? (timecmp & 0x) : timecmp;
 } else if ((addr & 0x7) == 4) {
 /* timecmp_hi */
-uint64_t timecmp = env->timecmp;
+uint64_t timecmp = mtimer->timecmp[hartid];
 return (timecmp >> 32) & 0x;
 } else {
 qemu_log_mask(LOG_UNIMP,
@@ -177,7 +177,7 @@ static void riscv_aclint_mtimer_write(void *opaque, hwaddr 
addr,
 } else if ((addr & 0x7) == 0) {
 if (size == 4) {
 /* timecmp_lo for RV32/RV64 */
-uint64_t timecmp_hi = env->timecmp >> 32;
+uint64_t timecmp_hi = mtimer->timecmp[hartid] >> 32;
 riscv_aclint_mtimer_write_timecmp(mtimer, RISCV_CPU(cpu), 
hartid,
 timecmp_hi << 32 | (value & 0x));
 } else {
@@ -188,7 +188,7 @@ static void riscv_aclint_mtimer_write(void *opaque, hwaddr 
addr,
 } else if ((addr & 0x7) == 4) {
 if (size == 4) {
 /* timecmp_hi for RV32/RV64 */
-uint64_t timecmp_lo = env->timecmp;
+uint64_t timecmp_lo = mtimer->timecmp[hartid];
 riscv_aclint_mtimer_write_timecmp(mtimer, RISCV_CPU(cpu), 
hartid,
 value << 32 | (timecmp_lo & 0x));
 } else {
@@ -233,7 +233,7 @@ static void riscv_aclint_mtimer_write(void *opaque, hwaddr 
addr,
 continue;
 }
 riscv_aclint_mtimer_write_timecmp(mtimer, RISCV_CPU(cpu),
-  i, env->timecmp);
+  i, mtimer->timecmp[i]);
 }
 return;
 }
@@ -283,6 +283,8 @@ static void riscv_aclint_mtimer_realize(DeviceState *dev, 
Error **errp)
 s->timer_irqs = g_new(qemu_irq, s->num_harts);
 qdev_init_gpio_out(dev, s->timer_irqs, s->num_harts);
 
+s->timers = g_malloc0(s->num_harts * sizeof(QEMUTimer));
+s->timecmp = g_new(uint64_t, s->num_harts);
 /* Claim timer interrupt bits */
 for (i = 0; i < s->num_harts; i++) {
 RISCVCPU *cpu = RISCV_CPU(qemu_get_cpu(s->hartid_base + i));
@@ -335,6 +337,7 @@ DeviceState *riscv_aclint_mtimer_create(hwaddr addr, hwaddr 
size,
 {
 int i;
 DeviceState *dev = qdev_new(TYPE_RISCV_ACLINT_MTIMER);
+RISCVAclintMTimerState *s = RISCV_ACLINT_MTIMER(dev);
 
 assert(num_harts <= RISCV_ACLINT_MAX_HARTS);
 assert(!(addr & 0x7));
@@ -365,11 +368,11 @@ DeviceState *riscv_aclint_mtimer_create(hwaddr addr, 
hwaddr size,
 riscv_cpu_set_rdtime_fn(env, cpu_riscv_read_rtc, dev);
 }
 
-cb->s = RISCV_ACLINT_MTIMER(dev);
+cb->s = s;
 cb->num = i;
-e

Re: [RFC 15/18] vfio/iommufd: Implement iommufd backend

2022-04-26 Thread Jason Gunthorpe
On Tue, Apr 26, 2022 at 02:59:31PM -0600, Alex Williamson wrote:

> > The best you could do is make a dummy IOAS then attach the device,
> > read the mappings, detatch, and then do your unmaps.
> 
> Right, the same thing the kernel does currently.
> 
> > I'm imagining something like IOMMUFD_DEVICE_GET_RANGES that can be
> > called prior to attaching on the device ID.
> 
> Something like /sys/kernel/iommu_groups/$GROUP/reserved_regions?

If we do the above ioctl with iommufd I would want to include the domain
aperture too, but yes.

> > > We must be absolutely certain that there is no DMA to that range
> > > before doing so.  
> > 
> > Yes, but at the same time if the VM thinks it can DMA to that memory
> > then it is quite likely to DMA to it with the new device that doesn't
> > have it mapped in the first place.
> 
> Sorry, this assertion doesn't make sense to me.  We can't assume a
> vIOMMU on x86, so QEMU typically maps the entire VM address space (ie.
> device address space == system memory).  Some of those mappings are
> likely DMA targets (RAM), but only a tiny fraction of the address space
> may actually be used for DMA.  Some of those mappings are exceedingly
> unlikely P2P DMA targets (device memory), so we don't consider mapping
> failures to be fatal to attaching the device.

> If we have a case where a range failed for one device but worked for a
> previous, we're in the latter scenario, because we should have failed
> the device attach otherwise.  Your assertion would require that there
> are existing devices (plural) making use of this mapping and that the
> new device is also likely to make use of this mapping.  I have a hard
> time believing that evidence exists to support that statement.

This is quite normal, we often have multiple NICs and GPUs in the same
system/VM and the expectation is that P2P between the MMIO regions of
all the NICs and all the GPUs will work. Hotplugging in a NIC or GPU
and having it be excluded from P2P maps would be fatal to the VM.

So, while I think it is vanishingly unlikely that a reserved region
conflict would cause a problem, my preference is that this stuff is
deterministic. Either hotplugs fails or hotplug configures it to the
same state it would be if the VM was started with this configuration.

Perhaps this just suggests that qemu should be told by the operator
what kind of P2P to export from a device 'never/auto/always' with auto
being today's behavior.

> P2P use cases are sufficiently rare that this hasn't been an issue.  I
> think there's also still a sufficient healthy dose of FUD whether a
> system supports P2P that drivers do some validation before relying on
> it.

I'm not sure what you mean here, the P2P capability discovery is a
complete mess and never did get standardized. Linux has the
expectation that drivers will use pci_p2pdma_distance() before doing
P2P which weeds out only some of the worst non-working cases.

> > This is why I find it bit strange that qemu doesn't check the
> > ranges. eg I would expect that anything declared as memory in the E820
> > map has to be mappable to the iommu_domain or the device should not
> > attach at all.
> 
> You have some interesting assumptions around associating
> MemoryRegionSegments from the device AddressSpace to something like an
> x86 specific E820 table.  

I'm thinking about it from an OS perspective in the VM, not from qemu
internals. OS's do not randomly DMA everwhere, the firmware tables/etc
do make it predictable where DMA will happen.

> > The P2P is a bit trickier, and I know we don't have a good story
> > because we lack ACPI description, but I would have expected the same
> > kind of thing. Anything P2Pable should be in the iommu_domain or the
> > device should not attach. As with system memory there are only certain
> > parts of the E820 map that an OS would use for P2P.
> > 
> > (ideally ACPI would indicate exactly what combinations of devices are
> > P2Pable and then qemu would use that drive the mandatory address
> > ranges in the IOAS)
> 
> How exactly does ACPI indicate that devices can do P2P?  How can we
> rely on ACPI for a problem that's not unique to platforms that
> implement ACPI?

I am trying to say this never did get standardized. It was talked about
when the pci_p2pdma_distance() was merged and I thought some folks
were going to go off and take care of an ACPI query for it to use. It
would be useful here at least.
 
> > > > > yeah. qemu can filter the P2P BAR mapping and just stop it in qemu. We
> > > > > haven't added it as it is something you will add in future. so didn't
> > > > > add it in this RFC. :-) Please let me know if it feels better to 
> > > > > filter
> > > > > it from today.
> > > > 
> > > > I currently hope it will use a different map API entirely and not rely
> > > > on discovering the P2P via the VMA. eg using a DMABUF FD or something.
> > > > 
> > > > So blocking it in qemu feels like the right thing to do.  
> > > 
> > > Wait a sec, so legacy vfi

[PATCH v2 4/5] target/riscv: Add stimecmp support

2022-04-26 Thread Atish Patra
stimecmp allows the supervisor mode to update stimecmp CSR directly
to program the next timer interrupt. This CSR is part of the Sstc
extension which was ratified recently.

Signed-off-by: Atish Patra 
---
 target/riscv/cpu.c |  8 
 target/riscv/cpu.h |  5 ++
 target/riscv/cpu_bits.h|  4 ++
 target/riscv/csr.c | 83 
 target/riscv/machine.c |  1 +
 target/riscv/meson.build   |  3 +-
 target/riscv/time_helper.c | 98 ++
 target/riscv/time_helper.h | 30 
 8 files changed, 231 insertions(+), 1 deletion(-)
 create mode 100644 target/riscv/time_helper.c
 create mode 100644 target/riscv/time_helper.h

diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
index 0c774056c5c3..6d16aeeacf0c 100644
--- a/target/riscv/cpu.c
+++ b/target/riscv/cpu.c
@@ -23,6 +23,7 @@
 #include "qemu/log.h"
 #include "cpu.h"
 #include "internals.h"
+#include "time_helper.h"
 #include "exec/exec-all.h"
 #include "qapi/error.h"
 #include "qemu/error-report.h"
@@ -777,7 +778,12 @@ static void riscv_cpu_init(Object *obj)
 #ifndef CONFIG_USER_ONLY
 qdev_init_gpio_in(DEVICE(cpu), riscv_cpu_set_irq,
   IRQ_LOCAL_MAX + IRQ_LOCAL_GUEST_MAX);
+
+if (cpu->cfg.ext_sstc) {
+riscv_timer_init(cpu);
+}
 #endif /* CONFIG_USER_ONLY */
+
 }
 
 static Property riscv_cpu_properties[] = {
@@ -804,6 +810,7 @@ static Property riscv_cpu_properties[] = {
 DEFINE_PROP_BOOL("mmu", RISCVCPU, cfg.mmu, true),
 DEFINE_PROP_BOOL("pmp", RISCVCPU, cfg.pmp, true),
 DEFINE_PROP_BOOL("debug", RISCVCPU, cfg.debug, true),
+DEFINE_PROP_BOOL("sstc", RISCVCPU, cfg.ext_sstc, true),
 
 DEFINE_PROP_STRING("priv_spec", RISCVCPU, cfg.priv_spec),
 DEFINE_PROP_STRING("vext_spec", RISCVCPU, cfg.vext_spec),
@@ -963,6 +970,7 @@ static void riscv_isa_string_ext(RISCVCPU *cpu, char 
**isa_str, int max_str_len)
 ISA_EDATA_ENTRY(zbs, ext_zbs),
 ISA_EDATA_ENTRY(zve32f, ext_zve32f),
 ISA_EDATA_ENTRY(zve64f, ext_zve64f),
+ISA_EDATA_ENTRY(sstc, ext_sstc),
 ISA_EDATA_ENTRY(svinval, ext_svinval),
 ISA_EDATA_ENTRY(svnapot, ext_svnapot),
 ISA_EDATA_ENTRY(svpbmt, ext_svpbmt),
diff --git a/target/riscv/cpu.h b/target/riscv/cpu.h
index 1119d5201066..9a5e02f217ba 100644
--- a/target/riscv/cpu.h
+++ b/target/riscv/cpu.h
@@ -276,6 +276,9 @@ struct CPUArchState {
 uint64_t mfromhost;
 uint64_t mtohost;
 
+/* Sstc CSRs */
+uint64_t stimecmp;
+
 /* physical memory protection */
 pmp_table_t pmp_state;
 target_ulong mseccfg;
@@ -329,6 +332,7 @@ struct CPUArchState {
 float_status fp_status;
 
 /* Fields from here on are preserved across CPU reset. */
+QEMUTimer *stimer; /* Internal timer for S-mode interrupt */
 
 hwaddr kernel_addr;
 hwaddr fdt_addr;
@@ -379,6 +383,7 @@ struct RISCVCPUConfig {
 bool ext_counters;
 bool ext_ifencei;
 bool ext_icsr;
+bool ext_sstc;
 bool ext_svinval;
 bool ext_svnapot;
 bool ext_svpbmt;
diff --git a/target/riscv/cpu_bits.h b/target/riscv/cpu_bits.h
index 4a9e4f7d09d9..631b7c32b38f 100644
--- a/target/riscv/cpu_bits.h
+++ b/target/riscv/cpu_bits.h
@@ -212,6 +212,10 @@
 #define CSR_STVAL   0x143
 #define CSR_SIP 0x144
 
+/* Sstc supervisor CSRs */
+#define CSR_STIMECMP0x14D
+#define CSR_STIMECMPH   0x15D
+
 /* Supervisor Protection and Translation */
 #define CSR_SPTBR   0x180
 #define CSR_SATP0x180
diff --git a/target/riscv/csr.c b/target/riscv/csr.c
index 6ba85e7b5da2..c51c05d2ea74 100644
--- a/target/riscv/csr.c
+++ b/target/riscv/csr.c
@@ -21,6 +21,7 @@
 #include "qemu/log.h"
 #include "qemu/timer.h"
 #include "cpu.h"
+#include "time_helper.h"
 #include "qemu/main-loop.h"
 #include "exec/exec-all.h"
 #include "sysemu/cpu-timers.h"
@@ -537,6 +538,78 @@ static RISCVException read_timeh(CPURISCVState *env, int 
csrno,
 return RISCV_EXCP_NONE;
 }
 
+static RISCVException sstc(CPURISCVState *env, int csrno)
+{
+CPUState *cs = env_cpu(env);
+RISCVCPU *cpu = RISCV_CPU(cs);
+
+if (!cpu->cfg.ext_sstc || !env->rdtime_fn) {
+return RISCV_EXCP_ILLEGAL_INST;
+}
+
+if (env->priv == PRV_M) {
+return RISCV_EXCP_NONE;
+}
+
+if (env->priv != PRV_S) {
+return RISCV_EXCP_ILLEGAL_INST;
+}
+
+/*
+ * No need of separate function for rv32 as menvcfg stores both menvcfg
+ * menvcfgh for RV32.
+ */
+if (!(get_field(env->mcounteren, COUNTEREN_TM) &&
+  get_field(env->menvcfg, MENVCFG_STCE))) {
+return RISCV_EXCP_ILLEGAL_INST;
+}
+
+return RISCV_EXCP_NONE;
+}
+
+static RISCVException read_stimecmp(CPURISCVState *env, int csrno,
+target_ulong *val)
+{
+*val = env->stimecmp;
+return RISCV_EXCP_NONE;
+}
+
+static RISCVException read_stimecmph(CPURISCVState *env, int csrno,
+ 

[PATCH v10 3/7] migration: Add zero-copy-send parameter for QMP/HMP for Linux

2022-04-26 Thread Leonardo Bras
Add property that allows zero-copy migration of memory pages
on the sending side, and also includes a helper function
migrate_use_zero_copy_send() to check if it's enabled.

No code is introduced to actually do the migration, but it allow
future implementations to enable/disable this feature.

On non-Linux builds this parameter is compiled-out.

Signed-off-by: Leonardo Bras 
Reviewed-by: Peter Xu 
Reviewed-by: Daniel P. Berrangé 
Reviewed-by: Juan Quintela 
---
 qapi/migration.json   | 24 
 migration/migration.h |  5 +
 migration/migration.c | 32 
 migration/socket.c| 11 +--
 monitor/hmp-cmds.c|  6 ++
 5 files changed, 76 insertions(+), 2 deletions(-)

diff --git a/qapi/migration.json b/qapi/migration.json
index 409eb086a2..04246481ce 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -741,6 +741,13 @@
 #  will consume more CPU.
 #  Defaults to 1. (Since 5.0)
 #
+# @zero-copy-send: Controls behavior on sending memory pages on migration.
+#  When true, enables a zero-copy mechanism for sending memory
+#  pages, if host supports it.
+#  Requires that QEMU be permitted to use locked memory for 
guest
+#  RAM pages.
+#  Defaults to false. (Since 7.1)
+#
 # @block-bitmap-mapping: Maps block nodes and bitmaps on them to
 #aliases for the purpose of dirty bitmap migration.  
Such
 #aliases may for example be the corresponding names on 
the
@@ -780,6 +787,7 @@
'xbzrle-cache-size', 'max-postcopy-bandwidth',
'max-cpu-throttle', 'multifd-compression',
'multifd-zlib-level' ,'multifd-zstd-level',
+   { 'name': 'zero-copy-send', 'if' : 'CONFIG_LINUX'},
'block-bitmap-mapping' ] }
 
 ##
@@ -906,6 +914,13 @@
 #  will consume more CPU.
 #  Defaults to 1. (Since 5.0)
 #
+# @zero-copy-send: Controls behavior on sending memory pages on migration.
+#  When true, enables a zero-copy mechanism for sending memory
+#  pages, if host supports it.
+#  Requires that QEMU be permitted to use locked memory for 
guest
+#  RAM pages.
+#  Defaults to false. (Since 7.1)
+#
 # @block-bitmap-mapping: Maps block nodes and bitmaps on them to
 #aliases for the purpose of dirty bitmap migration.  
Such
 #aliases may for example be the corresponding names on 
the
@@ -960,6 +975,7 @@
 '*multifd-compression': 'MultiFDCompression',
 '*multifd-zlib-level': 'uint8',
 '*multifd-zstd-level': 'uint8',
+'*zero-copy-send': { 'type': 'bool', 'if': 'CONFIG_LINUX' },
 '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
 
 ##
@@ -1106,6 +1122,13 @@
 #  will consume more CPU.
 #  Defaults to 1. (Since 5.0)
 #
+# @zero-copy-send: Controls behavior on sending memory pages on migration.
+#  When true, enables a zero-copy mechanism for sending memory
+#  pages, if host supports it.
+#  Requires that QEMU be permitted to use locked memory for 
guest
+#  RAM pages.
+#  Defaults to false. (Since 7.1)
+#
 # @block-bitmap-mapping: Maps block nodes and bitmaps on them to
 #aliases for the purpose of dirty bitmap migration.  
Such
 #aliases may for example be the corresponding names on 
the
@@ -1158,6 +1181,7 @@
 '*multifd-compression': 'MultiFDCompression',
 '*multifd-zlib-level': 'uint8',
 '*multifd-zstd-level': 'uint8',
+'*zero-copy-send': { 'type': 'bool', 'if': 'CONFIG_LINUX' },
 '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
 
 ##
diff --git a/migration/migration.h b/migration/migration.h
index a863032b71..e8f2941a55 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -375,6 +375,11 @@ MultiFDCompression migrate_multifd_compression(void);
 int migrate_multifd_zlib_level(void);
 int migrate_multifd_zstd_level(void);
 
+#ifdef CONFIG_LINUX
+bool migrate_use_zero_copy_send(void);
+#else
+#define migrate_use_zero_copy_send() (false)
+#endif
 int migrate_use_xbzrle(void);
 uint64_t migrate_xbzrle_cache_size(void);
 bool migrate_colo_enabled(void);
diff --git a/migration/migration.c b/migration/migration.c
index 5a31b23bd6..3e91f4b5e2 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -910,6 +910,10 @@ MigrationParameters *qmp_query_migrate_parameters(Error 
**errp)
 params->multifd_zlib_level = s->parameters.multifd_zlib_level;
 params->has_multifd_zstd_level = true;
 params->multifd_zstd_level = s->parameters.multifd_zstd_level;
+#ifdef CONFIG_LINUX
+   

[PATCH v10 7/7] multifd: Implement zero copy write in multifd migration (multifd-zero-copy)

2022-04-26 Thread Leonardo Bras
Implement zero copy send on nocomp_send_write(), by making use of QIOChannel
writev + flags & flush interface.

Change multifd_send_sync_main() so flush_zero_copy() can be called
after each iteration in order to make sure all dirty pages are sent before
a new iteration is started. It will also flush at the beginning and at the
end of migration.

Also make it return -1 if flush_zero_copy() fails, in order to cancel
the migration process, and avoid resuming the guest in the target host
without receiving all current RAM.

This will work fine on RAM migration because the RAM pages are not usually 
freed,
and there is no problem on changing the pages content between 
writev_zero_copy() and
the actual sending of the buffer, because this change will dirty the page and
cause it to be re-sent on a next iteration anyway.

A lot of locked memory may be needed in order to use multifd migration
with zero-copy enabled, so disabling the feature should be necessary for
low-privileged users trying to perform multifd migrations.

Signed-off-by: Leonardo Bras 
---
 migration/multifd.h   |  2 ++
 migration/migration.c | 11 ++-
 migration/multifd.c   | 37 +++--
 migration/socket.c|  5 +++--
 4 files changed, 50 insertions(+), 5 deletions(-)

diff --git a/migration/multifd.h b/migration/multifd.h
index bcf5992945..4d8d89e5e5 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -92,6 +92,8 @@ typedef struct {
 uint32_t packet_len;
 /* pointer to the packet */
 MultiFDPacket_t *packet;
+/* multifd flags for sending ram */
+int write_flags;
 /* multifd flags for each packet */
 uint32_t flags;
 /* size of the next packet that contains pages */
diff --git a/migration/migration.c b/migration/migration.c
index 4b6df2eb5e..31739b2af9 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1497,7 +1497,16 @@ static bool migrate_params_check(MigrationParameters 
*params, Error **errp)
 error_prepend(errp, "Invalid mapping given for block-bitmap-mapping: 
");
 return false;
 }
-
+#ifdef CONFIG_LINUX
+if (params->zero_copy_send &&
+(!migrate_use_multifd() ||
+ params->multifd_compression != MULTIFD_COMPRESSION_NONE ||
+ (params->tls_creds && *params->tls_creds))) {
+error_setg(errp,
+   "Zero copy only available for non-compressed non-TLS 
multifd migration");
+return false;
+}
+#endif
 return true;
 }
 
diff --git a/migration/multifd.c b/migration/multifd.c
index 07b2e92d8d..00b4040eca 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -569,6 +569,7 @@ void multifd_save_cleanup(void)
 int multifd_send_sync_main(QEMUFile *f)
 {
 int i;
+bool flush_zero_copy;
 
 if (!migrate_use_multifd()) {
 return 0;
@@ -579,6 +580,20 @@ int multifd_send_sync_main(QEMUFile *f)
 return -1;
 }
 }
+
+/*
+ * When using zero-copy, it's necessary to flush the pages before any of
+ * the pages can be sent again, so we'll make sure the new version of the
+ * pages will always arrive _later_ than the old pages.
+ *
+ * Currently we achieve this by flushing the zero-page requested writes
+ * per ram iteration, but in the future we could potentially optimize it
+ * to be less frequent, e.g. only after we finished one whole scanning of
+ * all the dirty bitmaps.
+ */
+
+flush_zero_copy = migrate_use_zero_copy_send();
+
 for (i = 0; i < migrate_multifd_channels(); i++) {
 MultiFDSendParams *p = &multifd_send_state->params[i];
 
@@ -600,6 +615,17 @@ int multifd_send_sync_main(QEMUFile *f)
 ram_counters.transferred += p->packet_len;
 qemu_mutex_unlock(&p->mutex);
 qemu_sem_post(&p->sem);
+
+if (flush_zero_copy && p->c) {
+int ret;
+Error *err = NULL;
+
+ret = qio_channel_flush(p->c, &err);
+if (ret < 0) {
+error_report_err(err);
+return -1;
+}
+}
 }
 for (i = 0; i < migrate_multifd_channels(); i++) {
 MultiFDSendParams *p = &multifd_send_state->params[i];
@@ -685,8 +711,8 @@ static void *multifd_send_thread(void *opaque)
 p->iov[0].iov_base = p->packet;
 }
 
-ret = qio_channel_writev_all(p->c, p->iov, p->iovs_num,
- &local_err);
+ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num, NULL,
+  0, p->write_flags, &local_err);
 if (ret != 0) {
 break;
 }
@@ -914,6 +940,13 @@ int multifd_save_setup(Error **errp)
 /* We need one extra place for the packet header */
 p->iov = g_new0(struct iovec, page_count + 1);
 p->normal = g_new0(ram_addr_t, page_count);
+
+if (migrate_use_zero_copy_send()) {
+p->write_flags = QIO_C

[PATCH v10 5/7] multifd: multifd_send_sync_main now returns negative on error

2022-04-26 Thread Leonardo Bras
Even though multifd_send_sync_main() currently emits error_reports, it's
callers don't really check it before continuing.

Change multifd_send_sync_main() to return -1 on error and 0 on success.
Also change all it's callers to make use of this change and possibly fail
earlier.

(This change is important to next patch on  multifd zero copy
implementation, to make it sure an error in zero-copy flush does not go
unnoticed.

Signed-off-by: Leonardo Bras 
Reviewed-by: Daniel P. Berrangé 
Reviewed-by: Peter Xu 
---
 migration/multifd.h |  2 +-
 migration/multifd.c | 10 ++
 migration/ram.c | 29 ++---
 3 files changed, 29 insertions(+), 12 deletions(-)

diff --git a/migration/multifd.h b/migration/multifd.h
index 7d0effcb03..bcf5992945 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -20,7 +20,7 @@ int multifd_load_cleanup(Error **errp);
 bool multifd_recv_all_channels_created(void);
 bool multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
 void multifd_recv_sync_main(void);
-void multifd_send_sync_main(QEMUFile *f);
+int multifd_send_sync_main(QEMUFile *f);
 int multifd_queue_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset);
 
 /* Multifd Compression flags */
diff --git a/migration/multifd.c b/migration/multifd.c
index 2a8c8570c3..15fb668e64 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -566,17 +566,17 @@ void multifd_save_cleanup(void)
 multifd_send_state = NULL;
 }
 
-void multifd_send_sync_main(QEMUFile *f)
+int multifd_send_sync_main(QEMUFile *f)
 {
 int i;
 
 if (!migrate_use_multifd()) {
-return;
+return 0;
 }
 if (multifd_send_state->pages->num) {
 if (multifd_send_pages(f) < 0) {
 error_report("%s: multifd_send_pages fail", __func__);
-return;
+return -1;
 }
 }
 for (i = 0; i < migrate_multifd_channels(); i++) {
@@ -589,7 +589,7 @@ void multifd_send_sync_main(QEMUFile *f)
 if (p->quit) {
 error_report("%s: channel %d has already quit", __func__, i);
 qemu_mutex_unlock(&p->mutex);
-return;
+return -1;
 }
 
 p->packet_num = multifd_send_state->packet_num++;
@@ -608,6 +608,8 @@ void multifd_send_sync_main(QEMUFile *f)
 qemu_sem_wait(&p->sem_sync);
 }
 trace_multifd_send_sync_main(multifd_send_state->packet_num);
+
+return 0;
 }
 
 static void *multifd_send_thread(void *opaque)
diff --git a/migration/ram.c b/migration/ram.c
index a2489a2699..5f5e37f64d 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2909,6 +2909,7 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
 {
 RAMState **rsp = opaque;
 RAMBlock *block;
+int ret;
 
 if (compress_threads_save_setup()) {
 return -1;
@@ -2943,7 +2944,11 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
 ram_control_before_iterate(f, RAM_CONTROL_SETUP);
 ram_control_after_iterate(f, RAM_CONTROL_SETUP);
 
-multifd_send_sync_main(f);
+ret =  multifd_send_sync_main(f);
+if (ret < 0) {
+return ret;
+}
+
 qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
 qemu_fflush(f);
 
@@ -3052,7 +3057,11 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
 out:
 if (ret >= 0
 && migration_is_setup_or_active(migrate_get_current()->state)) {
-multifd_send_sync_main(rs->f);
+ret = multifd_send_sync_main(rs->f);
+if (ret < 0) {
+return ret;
+}
+
 qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
 qemu_fflush(f);
 ram_transferred_add(8);
@@ -3112,13 +3121,19 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
 ram_control_after_iterate(f, RAM_CONTROL_FINISH);
 }
 
-if (ret >= 0) {
-multifd_send_sync_main(rs->f);
-qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
-qemu_fflush(f);
+if (ret < 0) {
+return ret;
 }
 
-return ret;
+ret = multifd_send_sync_main(rs->f);
+if (ret < 0) {
+return ret;
+}
+
+qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
+qemu_fflush(f);
+
+return 0;
 }
 
 static void ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,
-- 
2.36.0




[PATCH v10 2/7] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX

2022-04-26 Thread Leonardo Bras
For CONFIG_LINUX, implement the new zero copy flag and the optional callback
io_flush on QIOChannelSocket, but enables it only when MSG_ZEROCOPY
feature is available in the host kernel, which is checked on
qio_channel_socket_connect_sync()

qio_channel_socket_flush() was implemented by counting how many times
sendmsg(...,MSG_ZEROCOPY) was successfully called, and then reading the
socket's error queue, in order to find how many of them finished sending.
Flush will loop until those counters are the same, or until some error occurs.

Notes on using writev() with QIO_CHANNEL_WRITE_FLAG_ZERO_COPY:
1: Buffer
- As MSG_ZEROCOPY tells the kernel to use the same user buffer to avoid copying,
some caution is necessary to avoid overwriting any buffer before it's sent.
If something like this happen, a newer version of the buffer may be sent 
instead.
- If this is a problem, it's recommended to call qio_channel_flush() before 
freeing
or re-using the buffer.

2: Locked memory
- When using MSG_ZERCOCOPY, the buffer memory will be locked after queued, and
unlocked after it's sent.
- Depending on the size of each buffer, and how often it's sent, it may require
a larger amount of locked memory than usually available to non-root user.
- If the required amount of locked memory is not available, writev_zero_copy
will return an error, which can abort an operation like migration,
- Because of this, when an user code wants to add zero copy as a feature, it
requires a mechanism to disable it, so it can still be accessible to less
privileged users.

Signed-off-by: Leonardo Bras 
Reviewed-by: Peter Xu 
Reviewed-by: Daniel P. Berrangé 
Reviewed-by: Juan Quintela 
---
 include/io/channel-socket.h |   2 +
 io/channel-socket.c | 108 ++--
 2 files changed, 106 insertions(+), 4 deletions(-)

diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
index e747e63514..513c428fe4 100644
--- a/include/io/channel-socket.h
+++ b/include/io/channel-socket.h
@@ -47,6 +47,8 @@ struct QIOChannelSocket {
 socklen_t localAddrLen;
 struct sockaddr_storage remoteAddr;
 socklen_t remoteAddrLen;
+ssize_t zero_copy_queued;
+ssize_t zero_copy_sent;
 };
 
 
diff --git a/io/channel-socket.c b/io/channel-socket.c
index 696a04dc9c..1dd85fc1ef 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -25,6 +25,10 @@
 #include "io/channel-watch.h"
 #include "trace.h"
 #include "qapi/clone-visitor.h"
+#ifdef CONFIG_LINUX
+#include 
+#include 
+#endif
 
 #define SOCKET_MAX_FDS 16
 
@@ -54,6 +58,8 @@ qio_channel_socket_new(void)
 
 sioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET));
 sioc->fd = -1;
+sioc->zero_copy_queued = 0;
+sioc->zero_copy_sent = 0;
 
 ioc = QIO_CHANNEL(sioc);
 qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN);
@@ -153,6 +159,16 @@ int qio_channel_socket_connect_sync(QIOChannelSocket *ioc,
 return -1;
 }
 
+#ifdef CONFIG_LINUX
+int ret, v = 1;
+ret = setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v));
+if (ret == 0) {
+/* Zero copy available on host */
+qio_channel_set_feature(QIO_CHANNEL(ioc),
+QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY);
+}
+#endif
+
 return 0;
 }
 
@@ -533,6 +549,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
 char control[CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS)];
 size_t fdsize = sizeof(int) * nfds;
 struct cmsghdr *cmsg;
+int sflags = 0;
 
 memset(control, 0, CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS));
 
@@ -557,15 +574,27 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
 memcpy(CMSG_DATA(cmsg), fds, fdsize);
 }
 
+if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
+sflags = MSG_ZEROCOPY;
+}
+
  retry:
-ret = sendmsg(sioc->fd, &msg, 0);
+ret = sendmsg(sioc->fd, &msg, sflags);
 if (ret <= 0) {
-if (errno == EAGAIN) {
+switch (errno) {
+case EAGAIN:
 return QIO_CHANNEL_ERR_BLOCK;
-}
-if (errno == EINTR) {
+case EINTR:
 goto retry;
+case ENOBUFS:
+if (sflags & MSG_ZEROCOPY) {
+error_setg_errno(errp, errno,
+ "Process can't lock enough memory for using 
MSG_ZEROCOPY");
+return -1;
+}
+break;
 }
+
 error_setg_errno(errp, errno,
  "Unable to write to socket");
 return -1;
@@ -659,6 +688,74 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
 }
 #endif /* WIN32 */
 
+
+#ifdef CONFIG_LINUX
+static int qio_channel_socket_flush(QIOChannel *ioc,
+Error **errp)
+{
+QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
+struct msghdr msg = {};
+struct sock_extended_err *serr;
+struct cmsghdr *cm;
+char control[CMSG_SPACE(sizeof(*serr))];
+int received;
+int ret = 1;

[PATCH v10 1/7] QIOChannel: Add flags on io_writev and introduce io_flush callback

2022-04-26 Thread Leonardo Bras
Add flags to io_writev and introduce io_flush as optional callback to
QIOChannelClass, allowing the implementation of zero copy writes by
subclasses.

How to use them:
- Write data using qio_channel_writev*(...,QIO_CHANNEL_WRITE_FLAG_ZERO_COPY),
- Wait write completion with qio_channel_flush().

Notes:
As some zero copy write implementations work asynchronously, it's
recommended to keep the write buffer untouched until the return of
qio_channel_flush(), to avoid the risk of sending an updated buffer
instead of the buffer state during write.

As io_flush callback is optional, if a subclass does not implement it, then:
- io_flush will return 0 without changing anything.

Also, some functions like qio_channel_writev_full_all() were adapted to
receive a flag parameter. That allows shared code between zero copy and
non-zero copy writev, and also an easier implementation on new flags.

Signed-off-by: Leonardo Bras 
Reviewed-by: Daniel P. Berrangé 
Reviewed-by: Peter Xu 
Reviewed-by: Juan Quintela 
---
 include/io/channel.h| 38 +-
 chardev/char-io.c   |  2 +-
 hw/remote/mpqemu-link.c |  2 +-
 io/channel-buffer.c |  1 +
 io/channel-command.c|  1 +
 io/channel-file.c   |  1 +
 io/channel-socket.c |  2 ++
 io/channel-tls.c|  1 +
 io/channel-websock.c|  1 +
 io/channel.c| 49 +++--
 migration/rdma.c|  1 +
 scsi/pr-manager-helper.c|  2 +-
 tests/unit/test-io-channel-socket.c |  1 +
 13 files changed, 88 insertions(+), 14 deletions(-)

diff --git a/include/io/channel.h b/include/io/channel.h
index 88988979f8..c680ee7480 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -32,12 +32,15 @@ OBJECT_DECLARE_TYPE(QIOChannel, QIOChannelClass,
 
 #define QIO_CHANNEL_ERR_BLOCK -2
 
+#define QIO_CHANNEL_WRITE_FLAG_ZERO_COPY 0x1
+
 typedef enum QIOChannelFeature QIOChannelFeature;
 
 enum QIOChannelFeature {
 QIO_CHANNEL_FEATURE_FD_PASS,
 QIO_CHANNEL_FEATURE_SHUTDOWN,
 QIO_CHANNEL_FEATURE_LISTEN,
+QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY,
 };
 
 
@@ -104,6 +107,7 @@ struct QIOChannelClass {
  size_t niov,
  int *fds,
  size_t nfds,
+ int flags,
  Error **errp);
 ssize_t (*io_readv)(QIOChannel *ioc,
 const struct iovec *iov,
@@ -136,6 +140,8 @@ struct QIOChannelClass {
   IOHandler *io_read,
   IOHandler *io_write,
   void *opaque);
+int (*io_flush)(QIOChannel *ioc,
+Error **errp);
 };
 
 /* General I/O handling functions */
@@ -228,6 +234,7 @@ ssize_t qio_channel_readv_full(QIOChannel *ioc,
  * @niov: the length of the @iov array
  * @fds: an array of file handles to send
  * @nfds: number of file handles in @fds
+ * @flags: write flags (QIO_CHANNEL_WRITE_FLAG_*)
  * @errp: pointer to a NULL-initialized error object
  *
  * Write data to the IO channel, reading it from the
@@ -260,6 +267,7 @@ ssize_t qio_channel_writev_full(QIOChannel *ioc,
 size_t niov,
 int *fds,
 size_t nfds,
+int flags,
 Error **errp);
 
 /**
@@ -837,6 +845,7 @@ int qio_channel_readv_full_all(QIOChannel *ioc,
  * @niov: the length of the @iov array
  * @fds: an array of file handles to send
  * @nfds: number of file handles in @fds
+ * @flags: write flags (QIO_CHANNEL_WRITE_FLAG_*)
  * @errp: pointer to a NULL-initialized error object
  *
  *
@@ -846,6 +855,14 @@ int qio_channel_readv_full_all(QIOChannel *ioc,
  * to be written, yielding from the current coroutine
  * if required.
  *
+ * If QIO_CHANNEL_WRITE_FLAG_ZERO_COPY is passed in flags,
+ * instead of waiting for all requested data to be written,
+ * this function will wait until it's all queued for writing.
+ * In this case, if the buffer gets changed between queueing and
+ * sending, the updated buffer will be sent. If this is not a
+ * desired behavior, it's suggested to call qio_channel_flush()
+ * before reusing the buffer.
+ *
  * Returns: 0 if all bytes were written, or -1 on error
  */
 
@@ -853,6 +870,25 @@ int qio_channel_writev_full_all(QIOChannel *ioc,
 const struct iovec *iov,
 size_t niov,
 int *fds, size_t nfds,
-Error **errp);
+int flags, Error **errp);
+
+/**
+ * qio_channel_flush:
+ * @ioc: the channel object
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Will block until every packet queued with
+ * qio_channel_writev_full() + QIO_

[PATCH v10 0/7] MSG_ZEROCOPY + multifd

2022-04-26 Thread Leonardo Bras
This patch series intends to enable MSG_ZEROCOPY in QIOChannel, and make
use of it for multifd migration performance improvement, by reducing cpu
usage.

Patch #1 creates new callbacks for QIOChannel, allowing the implementation
of zero copy writing.

Patch #2 implements io_writev flags and io_flush() on QIOChannelSocket,
making use of MSG_ZEROCOPY on Linux.

Patch #3 adds a "zero_copy_send" migration property, only available with
CONFIG_LINUX, and compiled-out in any other architectures.
This migration property has to be enabled before multifd migration starts.

Patch #4 adds a helper function that allows to see if TLS is going to be used.
This helper will be later used in patch #5.

Patch #5 changes multifd_send_sync_main() so it returns int instead of void.
The return value is used to understand if any error happened in the function,
allowing migration to possible fail earlier.

Patch #6 implements an workaround: The behavior introduced in d48c3a0445 is
hard to deal with in zerocopy, so a workaround is introduced to send the
header in a different syscall, without MSG_ZEROCOPY.

Patch #7 Makes use of QIOChannelSocket zero_copy implementation on
nocomp multifd migration.

Results:
In preliminary tests, the resource usage of __sys_sendmsg() reduced 15 times,
and the overall migration took 13-22% less time, based in synthetic cpu
workload.

In further tests, it was noted that, on multifd migration with 8 channels:
- On idle hosts, migration time reduced in 10% to 21%.
- On hosts busy with heavy cpu stress (1 stress thread per cpu, but
  not cpu-pinned) migration time reduced in ~25% by enabling zero-copy.
- On hosts with heavy cpu-pinned workloads (1 stress thread per cpu, 
  cpu-pinned), migration time reducted in ~66% by enabling zero-copy.

Above tests setup:
- Sending and Receiving hosts:
  - CPU : Intel(R) Xeon(R) Platinum 8276L CPU @ 2.20GHz (448 CPUS)
  - Network card: E810-C (100Gbps)
  - >1TB RAM
  - QEMU: Upstream master branch + This patchset
  - Linux: Upstream v5.15 
- VM configuration:
  - 28 VCPUs
  - 512GB RAM


---
Changes since v9:
- Patch #6 got simplified and improved (thanks Daniel)
- Patch #7 got better comments (thanks Peter Xu)

Changes since v8:
- Inserted two new patches #5 & #6, previous patch #5 is now #7.
- Workaround an optimization introduced in d48c3a0445
- Removed unnecessary assert in qio_channel_writev_full_all

Changes since v7:
- Migration property renamed from zero-copy to zero-copy-send
- A few early tests added to help misconfigurations to fail earlier
- qio_channel_full*_flags() renamed back to qio_channel_full*()
- multifd_send_sync_main() reverted back to not receiving a flag,
  so it always sync zero-copy when enabled.
- Improve code quality on a few points

Changes since v6:
- Remove io_writev_zero_copy(), and makes use of io_writev() new flags
  to achieve the same results.
- Rename io_flush_zero_copy() to io_flush()
- Previous patch #2 became too small, so it was squashed in previous
  patch #3 (now patch #2)

Changes since v5:
- flush_zero_copy now returns -1 on fail, 0 on success, and 1 when all
  processed writes were not able to use zerocopy in kernel.
- qio_channel_socket_poll() removed, using qio_channel_wait() instead
- ENOBUFS is now processed inside qio_channel_socket_writev_flags()
- Most zerocopy parameter validation moved to migrate_params_check(),
  leaving only feature test to socket_outgoing_migration() callback
- Naming went from *zerocopy to *zero_copy or *zero-copy, due to QAPI/QMP
  preferences
- Improved docs

Changes since v4:
- 3 patches got splitted in 6
- Flush is used for syncing after each iteration, instead of only at the end
- If zerocopy is not available, fail in connect instead of failing on write
- 'multifd-zerocopy' property renamed to 'zerocopy'
- Fail migrations that don't support zerocopy, if it's enabled.
- Instead of checking for zerocopy at each write, save the flags in
  MultiFDSendParams->write_flags and use them on write
- Reorganized flag usage in QIOChannelSocket 
- A lot of typos fixed
- More doc on buffer restrictions

Changes since v3:
- QIOChannel interface names changed from io_async_{writev,flush} to
  io_{writev,flush}_zerocopy
- Instead of falling back in case zerocopy is not implemented, return
  error and abort operation.
- Flush now waits as long as needed, or return error in case anything
  goes wrong, aborting the operation.
- Zerocopy is now conditional in multifd, being set by parameter
  multifd-zerocopy
- Moves zerocopy_flush to multifd_send_sync_main() from multifd_save_cleanup
  so migration can abort if flush goes wrong.
- Several other small improvements

Changes since v2:
- Patch #1: One more fallback
- Patch #2: Fall back to sync if fails to lock buffer memory in MSG_ZEROCOPY 
send.

Changes since v1:
- Reimplemented the patchset using async_write + async_flush approach.
- Implemented a flush to be able to tell whenever all data was written.

Leonardo Bras (7):
  QIOChannel: Add flags on io_write

Re: [RFC PATCH 1/7] target/ppc: Implement xxm[tf]acc and xxsetaccz

2022-04-26 Thread Richard Henderson

On 4/26/22 05:50, Lucas Mateus Castro(alqotel) wrote:

From: "Lucas Mateus Castro (alqotel)"

Implement the following PowerISA v3.1 instructions:
xxmfacc: VSX Move From Accumulator
xxmtacc: VSX Move To Accumulator
xxsetaccz: VSX Set Accumulator to Zero

The PowerISA 3.1 mentions that for the current version of the
architecture, "the hardware implementation provides the effect of ACC[i]
and VSRs 4*i to 4*i + 3 logically containing the same data" and "The
Accumulators introduce no new logical state at this time" (page 501).
For now it seems unnecessary to create new structures, so this patch
just uses ACC[i] as VSRs 4*i to 4*i+3 and therefore move to and from
accumulators are no-ops.

Signed-off-by: Lucas Mateus Castro (alqotel)
---
  target/ppc/insn32.decode|  9 
  target/ppc/translate/vsx-impl.c.inc | 36 +
  2 files changed, 45 insertions(+)


Reviewed-by: Richard Henderson 



+TCGv_i64 zero = tcg_constant_i64(0);
+for (i = 0; i < 4; i++) {
+set_cpu_vsr(a->ra * 4 + i, zero, false);
+set_cpu_vsr(a->ra * 4 + i, zero, true);
+}


or

  tcg_gen_gvec_dup_imm(MO_64, acc_full_offset(a->ra), 64, 64, 0);


r~



Re: [PATCH v9 7/7] multifd: Implement zero copy write in multifd migration (multifd-zero-copy)

2022-04-26 Thread Leonardo Bras Soares Passos
Hello Peter, thanks for helping!

On Tue, Apr 26, 2022 at 1:02 PM Peter Xu  wrote:
>
> Leo,
>
> This patch looks mostly good to me, a few nitpicks below.
>
> On Mon, Apr 25, 2022 at 06:50:56PM -0300, Leonardo Bras wrote:
[...]
> >  }
> > +
> > +/*
> > + * When using zero-copy, it's necessary to flush after each iteration 
> > to
> > + * make sure pages from earlier iterations don't end up replacing newer
> > + * pages.
> > + */
> > +flush_zero_copy = migrate_use_zero_copy_send();
>
> Would you mind inline it if it's only used once?

It's not obvious in the diff, but this is used in a loop bellow, so I inserted
the variable to avoid calling migrate_use_zero_copy_send() for each
multifd channel.

>
> It's great to have that comment, but IMHO it could be more explicit, even
> marking a TODO showing that maybe we could do better in the future:
>
>   /*
>* When using zero-copy, it's necessary to flush the pages before any of
>* the pages can be sent again, so we'll make sure the new version of the
>* pages will always arrive _later_ than the old pages.
>*
>* Currently we achieve this by flushing the zero-page requested writes
>* per ram iteration, but in the future we could potentially optimize it
>* to be less frequent, e.g. only after we finished one whole scanning of
>* all the dirty bitmaps.
>*/
>

Thanks! I will insert that in the next version.

The thing here is that I was under the impression an iteration was equivalent to
a whole scanning of all the dirty bitmaps. I see now that it may not
be the case.

[...]
> > @@ -688,10 +708,9 @@ static void *multifd_send_thread(void *opaque)
> >  p->iov[0].iov_base = p->packet;
> >  }
> >
> > -ret = qio_channel_writev_all(p->c, p->iov + iov_offset,
> > - p->iovs_num - iov_offset,
> > - &local_err);
> > -
> > +ret = qio_channel_writev_full_all(p->c, p->iov + iov_offset,
> > +  p->iovs_num - iov_offset, 
> > NULL,
> > +  0, p->write_flags, 
> > &local_err);
>
> I kind of agree with Dan in previous patch - this iov_offset is confusing,
> better drop it.

Sure, fixed for v10.

>
[...]
> --
> Peter Xu
>

Best regards,
Leo




Re: [PATCH v9 6/7] multifd: Send header packet without flags if zero-copy-send is enabled

2022-04-26 Thread Leonardo Bras Soares Passos
Hello Daniel, thank you for the feedback!

On Tue, Apr 26, 2022 at 5:11 AM Daniel P. Berrangé  wrote:
>
> On Mon, Apr 25, 2022 at 06:50:55PM -0300, Leonardo Bras wrote:
> > Since d48c3a0445 ("multifd: Use a single writev on the send side"),
> > sending the header packet and the memory pages happens in the same
> > writev, which can potentially make the migration faster.
> >
> > Using channel-socket as example, this works well with the default copying
> > mechanism of sendmsg(), but with zero-copy-send=true, it will cause
> > the migration to often break.
> >
> > This happens because the header packet buffer gets reused quite often,
> > and there is a high chance that by the time the MSG_ZEROCOPY mechanism get
> > to send the buffer, it has already changed, sending the wrong data and
> > causing the migration to abort.
> >
> > It means that, as it is, the buffer for the header packet is not suitable
> > for sending with MSG_ZEROCOPY.
> >
> > In order to enable zero copy for multifd, send the header packet on an
> > individual write(), without any flags, and the remanining pages with a
> > writev(), as it was happening before. This only changes how a migration
> > with zero-copy-send=true works, not changing any current behavior for
> > migrations with zero-copy-send=false.
> >
> > Signed-off-by: Leonardo Bras 
> > ---
> >  migration/multifd.c | 29 ++---
> >  1 file changed, 26 insertions(+), 3 deletions(-)
> >
> > diff --git a/migration/multifd.c b/migration/multifd.c
> > index 15fb668e64..6c940aaa98 100644
> > --- a/migration/multifd.c
> > +++ b/migration/multifd.c
> > @@ -639,6 +639,8 @@ static void *multifd_send_thread(void *opaque)
> >  if (p->pending_job) {
> >  uint64_t packet_num = p->packet_num;
> >  uint32_t flags = p->flags;
> > +int iov_offset = 0;
> > +
>
> No need for this if you change:
>
> >  p->iovs_num = 1;
>
>if (!migrate_use_zero_copy_send()) {
>   p->iovs_num = 1;
>}
>

I understand the point now: setting p->iovs_num = 0 before
multifd_send_state->ops->send_prepare() causes p->iov[0] to be used for
pages instead of the header. I was not aware, so thanks for pointing that out!

But it's also necessary to have an else clause with p->iovs_num = 0, right?
It seems like the variable is not set anywhere else, and it would keep growing
after the second loop iteration, causing prepare() to access p->iov[]
outside bounds.

Am I missing something here?

>
> >  p->normal_num = 0;
> >
> > @@ -665,15 +667,36 @@ static void *multifd_send_thread(void *opaque)
> >  trace_multifd_send(p->id, packet_num, p->normal_num, flags,
> > p->next_packet_size);
> >
> > -p->iov[0].iov_len = p->packet_len;
> > -p->iov[0].iov_base = p->packet;
> > +if (migrate_use_zero_copy_send()) {
> > +/* Send header without zerocopy */
> > +ret = qio_channel_write_all(p->c, (void *)p->packet,
> > +p->packet_len, &local_err);
> > +if (ret != 0) {
> > +break;
> > +}
> > +
> > +if (!p->normal_num) {
> > +/* No pages will be sent */
> > +goto skip_send;
> > +}
>
> Don't need this AFAIK, because the qio_channel_writev_all
> call will be a no-op if  iovs_num is zero
>

Oh, I see:
qio_channel_writev_all() will call qio_channel_writev_full_all() where
niov == 0 and thus nlocal_iov == 0, avoiding the loop that calls
qio_channel_writev_full().

I will remove that in v10


> >
> > -ret = qio_channel_writev_all(p->c, p->iov, p->iovs_num,
> > +/* Skip first iov : header */
> > +iov_offset = 1;
>
> Don't need to set this

Agree, that makes sense since the offset part is discontinued.

>
> > +} else {
> > +/* Send header using the same writev call */
> > +p->iov[0].iov_len = p->packet_len;
> > +p->iov[0].iov_base = p->packet;
> > +}
> > +
> > +ret = qio_channel_writev_all(p->c, p->iov + iov_offset,
> > + p->iovs_num - iov_offset,
> >   &local_err);
>
> This wouldn't need changing if we don't reserve iovs[0] when
> not required.

Agree.

>
> > +
> >  if (ret != 0) {
> >  break;
> >  }
> >
> > +skip_send:
> >  qemu_mutex_lock(&p->mutex);
> >  p->pending_job--;
> >  qemu_mutex_unlock(&p->mutex);
> > --
> > 2.36.0
> >
>
> With regards,
> Daniel
> --
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
>

I w

Re: [PULL 00/68] nios2 patch queue

2022-04-26 Thread Richard Henderson

On 4/26/22 11:17, Richard Henderson wrote:

The following changes since commit a1755db71e34df016ffc10aa0727360aae2c6036:

   Merge tag 'pull-block-2022-04-25' of https://gitlab.com/hreitz/qemu into 
staging (2022-04-25 13:35:41 -0700)

are available in the Git repository at:

   https://gitlab.com/rth7680/qemu.git tags/pull-nios2-20220426

for you to fetch changes up to 7f176c5a0bcb70492f3b158a36311e75f1eb87d7:

   tests/tcg/nios2: Add test-shadow-1 (2022-04-26 08:17:10 -0700)


Fix nios2-linux-user syscalls.
Fix nios2-linux-user sigreturn.
Enable tests for nios2-linux-user.
Remove special handling of SIGSEGV.
Check supervisor for eret, bret.
Split special registers out of env->regs[].
Clean up interrupt processing.
Raise unaligned data and destination exceptions.
Set TLBMISC fields correctly on exceptions.
Prevent writes to read-only or reserved control fields.
Use tcg_constant_tl().
Implement shadow register sets.
Implement external interrupt controller interface.
Implement vectored interrupt controller.
Enable semihosting tests for nios2-softmmu.


Applied, thanks.  Please update https://wiki.qemu.org/ChangeLog/7.1 as 
appropriate.


r~





Amir Gonnen (5):
   target/nios2: Check supervisor on eret
   target/nios2: Add NUM_GP_REGS and NUM_CP_REGS
   target/nios2: Split out helper for eret instruction
   hw/intc: Vectored Interrupt Controller (VIC)
   hw/nios2: Machine with a Vectored Interrupt Controller

Richard Henderson (63):
   linux-user/nios2: Hoist pc advance to the top of EXCP_TRAP
   linux-user/nios2: Fix clone child return
   linux-user/nios2: Drop syscall 0 "workaround"
   linux-user/nios2: Adjust error return
   linux-user/nios2: Handle special qemu syscall return values
   linux-user/nios2: Remove do_sigreturn
   linux-user/nios2: Use QEMU_ESIGRETURN from do_rt_sigreturn
   tests/tcg/nios2: Re-enable linux-user tests
   target/nios2: Remove user-only nios2_cpu_do_interrupt
   target/nios2: Remove nios2_cpu_record_sigsegv
   target/nios2: Build helper.c for system only
   linux-user/nios2: Use force_sig_fault for EXCP_DEBUG
   target/nios2: Stop generating code if gen_check_supervisor fails
   target/nios2: Split PC out of env->regs[]
   target/nios2: Fix BRET instruction
   target/nios2: Do not create TCGv for control registers
   linux-user/nios2: Only initialize SP and PC in target_cpu_copy_regs
   target/nios2: Remove cpu_interrupts_enabled
   target/nios2: Split control registers away from general registers
   target/nios2: Clean up nios2_cpu_dump_state
   target/nios2: Use hw/registerfields.h for CR_STATUS fields
   target/nios2: Use hw/registerfields.h for CR_EXCEPTION fields
   target/nios2: Use hw/registerfields.h for CR_TLBADDR fields
   target/nios2: Use hw/registerfields.h for CR_TLBACC fields
   target/nios2: Rename CR_TLBMISC_WR to CR_TLBMISC_WE
   target/nios2: Use hw/registerfields.h for CR_TLBMISC fields
   target/nios2: Move R_FOO and CR_BAR into enumerations
   target/nios2: Create EXCP_SEMIHOST for semi-hosting
   target/nios2: Clean up nios2_cpu_do_interrupt
   target/nios2: Hoist CPU_LOG_INT logging
   target/nios2: Handle EXCP_UNALIGN and EXCP_UNALIGND
   target/nios2: Cleanup set of CR_EXCEPTION for do_interrupt
   target/nios2: Clean up handling of tlbmisc in do_exception
   target/nios2: Prevent writes to read-only or reserved control fields
   target/nios2: Implement cpuid
   target/nios2: Implement CR_STATUS.RSIE
   target/nios2: Remove CPU_INTERRUPT_NMI
   target/nios2: Support division error exception
   target/nios2: Use tcg_constant_tl
   target/nios2: Split out named structs for [IRJ]_TYPE
   target/nios2: Split out helpers for gen_i_cmpxx
   target/nios2: Split out helpers for gen_i_math_logic
   target/nios2: Split out helpers for gen_r_math_logic
   target/nios2: Split out helpers for gen_rr_mul_high
   target/nios2: Split out helpers for gen_rr_shift
   target/nios2: Introduce dest_gpr
   target/nios2: Drop CR_STATUS_EH from tb->flags
   target/nios2: Enable unaligned traps for system mode
   target/nios2: Create gen_jumpr
   target/nios2: Hoist set of is_jmp into gen_goto_tb
   target/nios2: Use gen_goto_tb for DISAS_TOO_MANY
   target/nios2: Use tcg_gen_lookup_and_goto_ptr
   target/nios2: Implement Misaligned destination exception
   target/nios2: Introduce shadow register sets
   target/nios2: Implement rdprs, wrprs
   target/nios2: Update helper_eret for shadow registers
   target/nios2: Implement EIC interrupt processing
   target/nios2: Advance pc when raising exceptions
   linux-user/nios2: Handle various SIGILL exceptions
   hw/nios2

Re: [PATCH v7 08/12] target/riscv: Add sscofpmf extension support

2022-04-26 Thread Atish Patra
On Mon, Apr 18, 2022 at 3:46 PM Alistair Francis  wrote:
>
> On Sat, Apr 16, 2022 at 9:54 AM Atish Kumar Patra  wrote:
> >
> > On Wed, Apr 13, 2022 at 12:08 AM Alistair Francis  
> > wrote:
> > >
> > > On Thu, Mar 31, 2022 at 10:19 AM Atish Patra  wrote:
> > > >
> > > > The Sscofpmf ('Ss' for Privileged arch and Supervisor-level extensions,
> > > > and 'cofpmf' for Count OverFlow and Privilege Mode Filtering)
> > > > extension allows the perf to handle overflow interrupts and filtering
> > > > support. This patch provides a framework for programmable
> > > > counters to leverage the extension. As the extension doesn't have any
> > > > provision for the overflow bit for fixed counters, the fixed events
> > > > can also be monitoring using programmable counters. The underlying
> > > > counters for cycle and instruction counters are always running. Thus,
> > > > a separate timer device is programmed to handle the overflow.
> > > >
> > > > Signed-off-by: Atish Patra 
> > > > Signed-off-by: Atish Patra 
> > > > ---
> > > >  target/riscv/cpu.c  |  11 ++
> > > >  target/riscv/cpu.h  |  25 +++
> > > >  target/riscv/cpu_bits.h |  55 +++
> > > >  target/riscv/csr.c  | 156 --
> > > >  target/riscv/pmu.c  | 347 +++-
> > > >  target/riscv/pmu.h  |   7 +
> > > >  6 files changed, 590 insertions(+), 11 deletions(-)
> > > >
> > > > diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
> > > > index f63602828680..9715eed2fc4e 100644
> > > > --- a/target/riscv/cpu.c
> > > > +++ b/target/riscv/cpu.c
> > > > @@ -22,6 +22,7 @@
> > > >  #include "qemu/ctype.h"
> > > >  #include "qemu/log.h"
> > > >  #include "cpu.h"
> > > > +#include "pmu.h"
> > > >  #include "internals.h"
> > > >  #include "exec/exec-all.h"
> > > >  #include "qapi/error.h"
> > > > @@ -696,6 +697,15 @@ static void riscv_cpu_realize(DeviceState *dev, 
> > > > Error **errp)
> > > >  set_misa(env, env->misa_mxl, ext);
> > > >  }
> > > >
> > > > +#ifndef CONFIG_USER_ONLY
> > > > +if (cpu->cfg.pmu_num) {
> > > > +if (!riscv_pmu_init(cpu, cpu->cfg.pmu_num) && 
> > > > cpu->cfg.ext_sscofpmf) {
> > > > +cpu->pmu_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
> > > > +  riscv_pmu_timer_cb, cpu);
> > > > +}
> > > > + }
> > > > +#endif
> > > > +
> > > >  riscv_cpu_register_gdb_regs_for_features(cs);
> > > >
> > > >  qemu_init_vcpu(cs);
> > > > @@ -795,6 +805,7 @@ static Property riscv_cpu_properties[] = {
> > > >  DEFINE_PROP_BOOL("v", RISCVCPU, cfg.ext_v, false),
> > > >  DEFINE_PROP_BOOL("h", RISCVCPU, cfg.ext_h, true),
> > > >  DEFINE_PROP_UINT8("pmu-num", RISCVCPU, cfg.pmu_num, 16),
> > > > +DEFINE_PROP_BOOL("sscofpmf", RISCVCPU, cfg.ext_sscofpmf, false),
> > > >  DEFINE_PROP_BOOL("Zifencei", RISCVCPU, cfg.ext_ifencei, true),
> > > >  DEFINE_PROP_BOOL("Zicsr", RISCVCPU, cfg.ext_icsr, true),
> > > >  DEFINE_PROP_BOOL("Zfh", RISCVCPU, cfg.ext_zfh, false),
> > > > diff --git a/target/riscv/cpu.h b/target/riscv/cpu.h
> > > > index 0fa15595fb37..a0e2279ea5e6 100644
> > > > --- a/target/riscv/cpu.h
> > > > +++ b/target/riscv/cpu.h
> > > > @@ -131,6 +131,8 @@ typedef struct PMUCTRState {
> > > >  /* Snapshort value of a counter in RV32 */
> > > >  target_ulong mhpmcounterh_prev;
> > > >  bool started;
> > > > +/* Value beyond UINT32_MAX/UINT64_MAX before overflow interrupt 
> > > > trigger */
> > > > +target_ulong irq_overflow_left;
> > > >  } PMUCTRState;
> > > >
> > > >  struct CPUArchState {
> > > > @@ -291,6 +293,9 @@ struct CPUArchState {
> > > >  /* PMU event selector configured values. First three are unused*/
> > > >  target_ulong mhpmevent_val[RV_MAX_MHPMEVENTS];
> > > >
> > > > +/* PMU event selector configured values for RV32*/
> > > > +target_ulong mhpmeventh_val[RV_MAX_MHPMEVENTS];
> > > > +
> > > >  target_ulong sscratch;
> > > >  target_ulong mscratch;
> > > >
> > > > @@ -413,6 +418,7 @@ struct RISCVCPUConfig {
> > > >  bool ext_zhinxmin;
> > > >  bool ext_zve32f;
> > > >  bool ext_zve64f;
> > > > +bool ext_sscofpmf;
> > > >
> > > >  /* Vendor-specific custom extensions */
> > > >  bool ext_XVentanaCondOps;
> > > > @@ -452,6 +458,12 @@ struct ArchCPU {
> > > >
> > > >  /* Configuration Settings */
> > > >  RISCVCPUConfig cfg;
> > > > +
> > > > +QEMUTimer *pmu_timer;
> > > > +/* A bitmask of Available programmable counters */
> > > > +uint32_t pmu_avail_ctrs;
> > > > +/* Mapping of events to counters */
> > > > +GHashTable *pmu_event_ctr_map;
> > > >  };
> > > >
> > > >  static inline int riscv_has_ext(CPURISCVState *env, target_ulong ext)
> > > > @@ -709,6 +721,19 @@ enum {
> > > >  CSR_TABLE_SIZE = 0x1000
> > > >  };
> > > >
> > > > +/**
> > > > + * The event id are encoded based on the encoding specified in the
> > > > + * SBI specification v0.3
> > > > + */
> > > > +
> 

Re: [PATCH 03/20] target/ppc: Substitute msr_pr macro with new M_MSR_PR macro

2022-04-26 Thread Richard Henderson

On 4/22/22 11:54, Víctor Colombo wrote:

Suggested-by: Richard Henderson 
Signed-off-by: Víctor Colombo 
---
  hw/ppc/pegasos2.c|  2 +-
  hw/ppc/spapr.c   |  2 +-
  target/ppc/cpu.h |  3 ++-
  target/ppc/cpu_init.c|  4 ++--
  target/ppc/excp_helper.c |  6 +++---
  target/ppc/mem_helper.c  |  4 ++--
  target/ppc/mmu-radix64.c |  4 ++--
  target/ppc/mmu_common.c  | 23 ---
  8 files changed, 25 insertions(+), 23 deletions(-)

diff --git a/hw/ppc/pegasos2.c b/hw/ppc/pegasos2.c
index 56bf203dfd..27ed54a71d 100644
--- a/hw/ppc/pegasos2.c
+++ b/hw/ppc/pegasos2.c
@@ -461,7 +461,7 @@ static void pegasos2_hypercall(PPCVirtualHypervisor *vhyp, 
PowerPCCPU *cpu)
  /* The TCG path should also be holding the BQL at this point */
  g_assert(qemu_mutex_iothread_locked());
  
-if (msr_pr) {

+if (env->msr & M_MSR_PR) {


I'm not sure I'm keen on the M_ prefix, but I'll defer to Cedric or Daniel if they're ok 
with it.


In general there are inconsistencies with the use of MSR_PR (1 vs 1ull), which makes it 
tempting to replace MSR_PR the bit number with MSR_PR the mask and leave off the M_ 
prefix.  It's somewhat easy for MSR_PR, since missed conversions will certainly result in 
compiler warnings for out-of-range shift (the same would not be true with bits 0-6, LE 
through EP).


Another possibility would be to use hw/registerfields.h.  Missed conversions are missing 
symbol errors.  You'd write FIELD_EX64(env->msr, MSR, PR) in cases like this and 
R_MSR_PR_MASK in cases like cpu_init.c.  It's more verbose for single bits like this, but 
much easier to work with multi-bit fields like MSR.TS.



r~



Re: [PATCH 02/20] target/ppc: Remove unused msr_* macros

2022-04-26 Thread Richard Henderson

On 4/22/22 11:54, Víctor Colombo wrote:

Some msr_* macros are not used anywhere. Remove them as part of
the work to remove all hidden usage of *env.

Suggested-by: Richard Henderson
Signed-off-by: Víctor Colombo
---
  target/ppc/cpu.h | 21 -
  1 file changed, 21 deletions(-)


Reviewed-by: Richard Henderson 

r~



Re: [PATCH 01/20] target/ppc: Remove fpscr_* macros from cpu.h

2022-04-26 Thread Richard Henderson

On 4/22/22 11:54, Víctor Colombo wrote:

fpscr_* defined macros are hiding the usage of *env behind them.
Substitute the usage of these macros with `env->fpscr & FP_*` to make
the code cleaner.

Suggested-by: Richard Henderson
Signed-off-by: Víctor Colombo
---
  target/ppc/cpu.c|  2 +-
  target/ppc/cpu.h| 29 -
  target/ppc/fpu_helper.c | 28 ++--
  3 files changed, 15 insertions(+), 44 deletions(-)


Reviewed-by: Richard Henderson 

r~



Re: [RFC 15/18] vfio/iommufd: Implement iommufd backend

2022-04-26 Thread Alex Williamson
On Tue, 26 Apr 2022 16:27:03 -0300
Jason Gunthorpe  wrote:

> On Tue, Apr 26, 2022 at 12:45:41PM -0600, Alex Williamson wrote:
> > On Tue, 26 Apr 2022 11:11:56 -0300
> > Jason Gunthorpe  wrote:
> >   
> > > On Tue, Apr 26, 2022 at 10:08:30PM +0800, Yi Liu wrote:
> > >   
> > > > > I think it is strange that the allowed DMA a guest can do depends on
> > > > > the order how devices are plugged into the guest, and varys from
> > > > > device to device?
> > > > > 
> > > > > IMHO it would be nicer if qemu would be able to read the new reserved
> > > > > regions and unmap the conflicts before hot plugging the new device. We
> > > > > don't have a kernel API to do this, maybe we should have one?
> > > > 
> > > > For userspace drivers, it is fine to do it. For QEMU, it's not quite 
> > > > easy
> > > > since the IOVA is GPA which is determined per the e820 table.
> > > 
> > > Sure, that is why I said we may need a new API to get this data back
> > > so userspace can fix the address map before attempting to attach the
> > > new device. Currently that is not possible at all, the device attach
> > > fails and userspace has no way to learn what addresses are causing
> > > problems.  
> > 
> > We have APIs to get the IOVA ranges, both with legacy vfio and the
> > iommufd RFC, QEMU could compare these, but deciding to remove an
> > existing mapping is not something to be done lightly.   
> 
> Not quite, you can get the IOVA ranges after you attach the device,
> but device attach will fail if the new range restrictions intersect
> with the existing mappings. So we don't have an easy way to learn the
> new range restriction in a way that lets userspace ensure an attach
> will not fail due to reserved ranged overlapping with mappings.
> 
> The best you could do is make a dummy IOAS then attach the device,
> read the mappings, detatch, and then do your unmaps.

Right, the same thing the kernel does currently.

> I'm imagining something like IOMMUFD_DEVICE_GET_RANGES that can be
> called prior to attaching on the device ID.

Something like /sys/kernel/iommu_groups/$GROUP/reserved_regions?

> > We must be absolutely certain that there is no DMA to that range
> > before doing so.  
> 
> Yes, but at the same time if the VM thinks it can DMA to that memory
> then it is quite likely to DMA to it with the new device that doesn't
> have it mapped in the first place.

Sorry, this assertion doesn't make sense to me.  We can't assume a
vIOMMU on x86, so QEMU typically maps the entire VM address space (ie.
device address space == system memory).  Some of those mappings are
likely DMA targets (RAM), but only a tiny fraction of the address space
may actually be used for DMA.  Some of those mappings are exceedingly
unlikely P2P DMA targets (device memory), so we don't consider mapping
failures to be fatal to attaching the device.

If we have a case where a range failed for one device but worked for a
previous, we're in the latter scenario, because we should have failed
the device attach otherwise.  Your assertion would require that there
are existing devices (plural) making use of this mapping and that the
new device is also likely to make use of this mapping.  I have a hard
time believing that evidence exists to support that statement.
 
> It is also a bit odd that the behavior depends on the order the
> devices are installed as if you plug the narrower device first then
> the next device will happily use the narrower ranges, but viceversa
> will get a different result.

P2P use cases are sufficiently rare that this hasn't been an issue.  I
think there's also still a sufficient healthy dose of FUD whether a
system supports P2P that drivers do some validation before relying on
it.
 
> This is why I find it bit strange that qemu doesn't check the
> ranges. eg I would expect that anything declared as memory in the E820
> map has to be mappable to the iommu_domain or the device should not
> attach at all.

You have some interesting assumptions around associating
MemoryRegionSegments from the device AddressSpace to something like an
x86 specific E820 table.  The currently used rule of thumb is that if
we think it's memory, mapping failure is fatal to the device, otherwise
it's not.  If we want each device to have the most complete mapping
possible, then we'd use a container per device, but that implies a lot
of extra overhead.  Instead we try to attach the device to an existing
container within the address space and assume if it was good enough
there, it's good enough here.

> The P2P is a bit trickier, and I know we don't have a good story
> because we lack ACPI description, but I would have expected the same
> kind of thing. Anything P2Pable should be in the iommu_domain or the
> device should not attach. As with system memory there are only certain
> parts of the E820 map that an OS would use for P2P.
> 
> (ideally ACPI would indicate exactly what combinations of devices are
> P2Pable and then qemu would use that drive the mandator

Re: [PATCH 00/20] target/ppc: Remove hidden usages of *env

2022-04-26 Thread Richard Henderson

On 4/22/22 11:54, Víctor Colombo wrote:

By running the grep command `git grep -nr 'define \(fpscr\|msr\)_[a-z0-9]\+\>'`
we can find multiple macros that use `env->fpscr` and `env->msr` but doesn't
take *env as a parameter.

Richard Henderson said [1] that these macros hiding the usage of *env "are 
evil".
This patch series remove them and substitute with an explicit usage of *env by
adding macros in the same style of FP_* ones (e.g. FP_FI defined in cpu.h).

Patch 20 (target/ppc: Add unused M_MSR_* macros) implements unused macros, the
same that were removed in patch 02 (target/ppc: Remove unused msr_* macros). I
did that to keep the changes consistent with what was already present before.


Oh frabjous day! Callooh! Callay!


r~



Re: [PATCH v2 02/26] Use QEMU_SANITIZE_ADDRESS

2022-04-26 Thread Richard Henderson

On 4/26/22 02:26, marcandre.lur...@redhat.com wrote:

From: Marc-André Lureau

Signed-off-by: Marc-André Lureau
---
  tests/qtest/fdc-test.c| 2 +-
  util/coroutine-ucontext.c | 2 +-
  2 files changed, 2 insertions(+), 2 deletions(-)


Reviewed-by: Richard Henderson 

r~



Re: [PATCH v2 01/26] Use QEMU_SANITIZE_THREAD

2022-04-26 Thread Richard Henderson

On 4/26/22 02:26, marcandre.lur...@redhat.com wrote:

From: Marc-André Lureau 

Signed-off-by: Marc-André Lureau 
Reviewed-by: Daniel P. Berrangé 
---
  include/qemu/atomic.h| 8 +---
  subprojects/libvhost-user/include/compiler.h | 1 +
  2 files changed, 6 insertions(+), 3 deletions(-)
  create mode 12 subprojects/libvhost-user/include/compiler.h


Reviewed-by: Richard Henderson 


+++ b/subprojects/libvhost-user/include/compiler.h
@@ -0,0 +1 @@
+../../../include/qemu/compiler.h
\ No newline at end of file


Fix the missing newline.


r~



Re: [PATCH 6/9] docs: move replay docs to docs/system/replay.rst

2022-04-26 Thread Richard Henderson

On 4/22/22 04:53, Pavel Dovgalyuk wrote:

This patch adds replay description page, converting prior
text from docs/replay.txt.
The text was also updated and some sections were moved
to devel part of the docs.

Signed-off-by: Pavel Dovgalyuk
---
  docs/devel/replay.rst  |  264 ++-
  docs/replay.txt|  407 
  docs/system/index.rst  |1
  docs/system/replay.rst |  237 
  4 files changed, 496 insertions(+), 413 deletions(-)
  delete mode 100644 docs/replay.txt
  create mode 100644 docs/system/replay.rst


Not thoroughly reviewed, but I browsed the outlines.
Acked-by: Richard Henderson 


r~



Re: [PATCH 5/9] docs: convert docs/devel/replay page to rst

2022-04-26 Thread Richard Henderson

On 4/22/22 04:53, Pavel Dovgalyuk wrote:

This patch converts prior .txt replay devel documentation to .rst.

Signed-off-by: Pavel Dovgalyuk 
---
  docs/devel/index-tcg.rst |2 ++
  docs/devel/replay.rst|   54 ++
  docs/devel/replay.txt|   46 ---
  3 files changed, 56 insertions(+), 46 deletions(-)
  create mode 100644 docs/devel/replay.rst
  delete mode 100644 docs/devel/replay.txt

diff --git a/docs/devel/index-tcg.rst b/docs/devel/index-tcg.rst
index 0b0ad12c22..52af5444d6 100644
--- a/docs/devel/index-tcg.rst
+++ b/docs/devel/index-tcg.rst
@@ -8,8 +8,10 @@ are only implementing things for HW accelerated hypervisors.
  .. toctree::
 :maxdepth: 2
  
+

 tcg
 decodetree


Watch extra whitespace.  Otherwise,
Reviewed-by: Richard Henderson 


r~



Re: [PATCH 4/9] replay: simplify async event processing

2022-04-26 Thread Richard Henderson

On 4/22/22 04:53, Pavel Dovgalyuk wrote:

  static Event *replay_read_event(void)
  {
  Event *event;
+int event_kind = replay_state.data_kind - EVENT_ASYNC;


Use the enum type.


+/* Asynchronous events IDs */
+
+enum ReplayAsyncEventKind {
+REPLAY_ASYNC_EVENT_BH,
+REPLAY_ASYNC_EVENT_BH_ONESHOT,
+REPLAY_ASYNC_EVENT_INPUT,
+REPLAY_ASYNC_EVENT_INPUT_SYNC,
+REPLAY_ASYNC_EVENT_CHAR_READ,
+REPLAY_ASYNC_EVENT_BLOCK,
+REPLAY_ASYNC_EVENT_NET,
+REPLAY_ASYNC_COUNT
+};

...

-enum ReplayAsyncEventKind {
-REPLAY_ASYNC_EVENT_BH,
-REPLAY_ASYNC_EVENT_BH_ONESHOT,
-REPLAY_ASYNC_EVENT_INPUT,
-REPLAY_ASYNC_EVENT_INPUT_SYNC,
-REPLAY_ASYNC_EVENT_CHAR_READ,
-REPLAY_ASYNC_EVENT_BLOCK,
-REPLAY_ASYNC_EVENT_NET,
-REPLAY_ASYNC_COUNT
-};
-
  typedef enum ReplayAsyncEventKind ReplayAsyncEventKind;


Merge or move the typedef with the enum definition, to keep them together.


@@ -59,7 +59,6 @@ static const VMStateDescription vmstate_replay = {
  VMSTATE_UINT32(has_unread_data, ReplayState),
  VMSTATE_UINT64(file_offset, ReplayState),
  VMSTATE_UINT64(block_request_id, ReplayState),
-VMSTATE_INT32(read_event_kind, ReplayState),
  VMSTATE_UINT64(read_event_id, ReplayState),
  VMSTATE_END_OF_LIST()
  },


Second change here, are you bumping version at the end and I haven't seen it 
yet?


r~



Re: [PATCH 10/26] nbd: add missing coroutine_fn annotations

2022-04-26 Thread Eric Blake
On Fri, Apr 15, 2022 at 03:18:44PM +0200, Paolo Bonzini wrote:
> Signed-off-by: Paolo Bonzini 
> ---
>  block/nbd.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)

As elsewhere in this series, a non-empty commit body would be useful.

Reviewed-by: Eric Blake 

> 
> diff --git a/block/nbd.c b/block/nbd.c
> index 5af4deac3f..a4c8d661ad 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -974,11 +974,11 @@ static void nbd_iter_request_error(NBDReplyChunkIter 
> *iter, int ret)
>   * nbd_reply_chunk_iter_receive
>   * The pointer stored in @payload requires g_free() to free it.
>   */
> -static bool nbd_reply_chunk_iter_receive(BDRVNBDState *s,
> - NBDReplyChunkIter *iter,
> - uint64_t handle,
> - QEMUIOVector *qiov, NBDReply *reply,
> - void **payload)
> +static bool coroutine_fn nbd_reply_chunk_iter_receive(BDRVNBDState *s,
> +  NBDReplyChunkIter 
> *iter,
> +  uint64_t handle,
> +  QEMUIOVector *qiov, 
> NBDReply *reply,

Perhaps worth rewrapping this line.

> +  void **payload)
>  {
>  int ret, request_ret;
>  NBDReply local_reply;
> -- 
> 2.35.1
> 
> 
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org




[PULL 10/13] nbd: move s->state under requests_lock

2022-04-26 Thread Eric Blake
From: Paolo Bonzini 

Remove the confusing, and most likely wrong, atomics.  The only function
that used to be somewhat in a hot path was nbd_client_connected(),
but it is not anymore after the previous patches.

The same logic is used both to check if a request had to be reissued
and also in nbd_reconnecting_attempt().  The former cases are outside
requests_lock, while nbd_reconnecting_attempt() does have the lock,
therefore the two have been separated in the previous commit.
nbd_client_will_reconnect() can simply take s->requests_lock, while
nbd_reconnecting_attempt() can inline the access now that no
complicated atomics are involved.

Signed-off-by: Paolo Bonzini 
Message-Id: <20220414175756.671165-8-pbonz...@redhat.com>
Reviewed-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Lukas Straub 
Signed-off-by: Eric Blake 
---
 block/nbd.c | 76 -
 1 file changed, 40 insertions(+), 36 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index 3cba874b1cf5..a5c690cef76b 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -35,7 +35,6 @@
 #include "qemu/option.h"
 #include "qemu/cutils.h"
 #include "qemu/main-loop.h"
-#include "qemu/atomic.h"

 #include "qapi/qapi-visit-sockets.h"
 #include "qapi/qmp/qstring.h"
@@ -72,10 +71,11 @@ typedef struct BDRVNBDState {
 NBDExportInfo info;

 /*
- * Protects free_sema, in_flight, requests[].coroutine,
+ * Protects state, free_sema, in_flight, requests[].coroutine,
  * reconnect_delay_timer.
  */
 QemuMutex requests_lock;
+NBDClientState state;
 CoQueue free_sema;
 int in_flight;
 NBDClientRequest requests[MAX_NBD_REQUESTS];
@@ -83,7 +83,6 @@ typedef struct BDRVNBDState {

 CoMutex send_mutex;
 CoMutex receive_mutex;
-NBDClientState state;

 QEMUTimer *open_timer;

@@ -132,11 +131,6 @@ static void nbd_clear_bdrvstate(BlockDriverState *bs)
 s->x_dirty_bitmap = NULL;
 }

-static bool nbd_client_connected(BDRVNBDState *s)
-{
-return qatomic_load_acquire(&s->state) == NBD_CLIENT_CONNECTED;
-}
-
 static bool coroutine_fn nbd_recv_coroutine_wake_one(NBDClientRequest *req)
 {
 if (req->receiving) {
@@ -159,14 +153,15 @@ static void coroutine_fn 
nbd_recv_coroutines_wake(BDRVNBDState *s, bool all)
 }
 }

-static void coroutine_fn nbd_channel_error(BDRVNBDState *s, int ret)
+/* Called with s->requests_lock held.  */
+static void coroutine_fn nbd_channel_error_locked(BDRVNBDState *s, int ret)
 {
-if (nbd_client_connected(s)) {
+if (s->state == NBD_CLIENT_CONNECTED) {
 qio_channel_shutdown(s->ioc, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
 }

 if (ret == -EIO) {
-if (nbd_client_connected(s)) {
+if (s->state == NBD_CLIENT_CONNECTED) {
 s->state = s->reconnect_delay ? NBD_CLIENT_CONNECTING_WAIT :
 NBD_CLIENT_CONNECTING_NOWAIT;
 }
@@ -177,6 +172,12 @@ static void coroutine_fn nbd_channel_error(BDRVNBDState 
*s, int ret)
 nbd_recv_coroutines_wake(s, true);
 }

+static void coroutine_fn nbd_channel_error(BDRVNBDState *s, int ret)
+{
+QEMU_LOCK_GUARD(&s->requests_lock);
+nbd_channel_error_locked(s, ret);
+}
+
 static void reconnect_delay_timer_del(BDRVNBDState *s)
 {
 if (s->reconnect_delay_timer) {
@@ -189,20 +190,18 @@ static void reconnect_delay_timer_cb(void *opaque)
 {
 BDRVNBDState *s = opaque;

-if (qatomic_load_acquire(&s->state) == NBD_CLIENT_CONNECTING_WAIT) {
+reconnect_delay_timer_del(s);
+WITH_QEMU_LOCK_GUARD(&s->requests_lock) {
+if (s->state != NBD_CLIENT_CONNECTING_WAIT) {
+return;
+}
 s->state = NBD_CLIENT_CONNECTING_NOWAIT;
-nbd_co_establish_connection_cancel(s->conn);
 }
-
-reconnect_delay_timer_del(s);
+nbd_co_establish_connection_cancel(s->conn);
 }

 static void reconnect_delay_timer_init(BDRVNBDState *s, uint64_t 
expire_time_ns)
 {
-if (qatomic_load_acquire(&s->state) != NBD_CLIENT_CONNECTING_WAIT) {
-return;
-}
-
 assert(!s->reconnect_delay_timer);
 s->reconnect_delay_timer = aio_timer_new(bdrv_get_aio_context(s->bs),
  QEMU_CLOCK_REALTIME,
@@ -225,7 +224,9 @@ static void nbd_teardown_connection(BlockDriverState *bs)
 s->ioc = NULL;
 }

-s->state = NBD_CLIENT_QUIT;
+WITH_QEMU_LOCK_GUARD(&s->requests_lock) {
+s->state = NBD_CLIENT_QUIT;
+}
 }

 static void open_timer_del(BDRVNBDState *s)
@@ -254,15 +255,15 @@ static void open_timer_init(BDRVNBDState *s, uint64_t 
expire_time_ns)
 timer_mod(s->open_timer, expire_time_ns);
 }

-static bool nbd_client_connecting_wait(BDRVNBDState *s)
-{
-return qatomic_load_acquire(&s->state) == NBD_CLIENT_CONNECTING_WAIT;
-}
-
 static bool nbd_client_will_reconnect(BDRVNBDState *s)
 {
-return qatomic_load_acquire(&s->state) == NBD_CLIENT_CONNECTING_WAIT;
+/*
+ * Called only after a socket error

  1   2   3   4   5   >