Re: [Qemu-devel] [PATCH 5/6] target-ppc: add lxvb16x and lxvh8x
Richard Hendersonwrites: > On 08/10/2016 02:51 PM, Nikunj A Dadhania wrote: >> I can fix the BE case using following but not sure if that will be >> correct ! >> >> tcg_gen_qemu_ld_i64(xth, EA, ctx->mem_idx, MO_Q); >> gen_helper_bswap32x2(xth, xth); >> tcg_gen_addi_tl(EA, EA, 8); >> tcg_gen_qemu_ld_i64(xtl, EA, ctx->mem_idx, MO_Q); >> gen_helper_bswap32x2(xtl, xtl); > > This cannot be correct, because it assumes a host-dependent byte ordering. > You > should be able to see different results depending on a BE or LE host. Forgot to mention, this was just for the BE part. Because MO_BEQ wasnt giving result as expected. if (ctx->le_mode) { /* Same as before */ } else { tcg_gen_qemu_ld_i64(xth, EA, ctx->mem_idx, MO_Q); gen_helper_bswap32x2(xth, xth); tcg_gen_addi_tl(EA, EA, 8); tcg_gen_qemu_ld_i64(xtl, EA, ctx->mem_idx, MO_Q); gen_helper_bswap32x2(xtl, xtl); } Regards Nikunj
Re: [Qemu-devel] [PATCH 5/6] target-ppc: add lxvb16x and lxvh8x
Richard Hendersonwrites: > On 08/10/2016 02:51 PM, Nikunj A Dadhania wrote: >> I can fix the BE case using following but not sure if that will be >> correct ! >> >> tcg_gen_qemu_ld_i64(xth, EA, ctx->mem_idx, MO_Q); >> gen_helper_bswap32x2(xth, xth); >> tcg_gen_addi_tl(EA, EA, 8); >> tcg_gen_qemu_ld_i64(xtl, EA, ctx->mem_idx, MO_Q); >> gen_helper_bswap32x2(xtl, xtl); > > This cannot be correct, because it assumes a host-dependent byte ordering. > You > should be able to see different results depending on a BE or LE host. > > >> __vector uint32_t vrt32; >> uint32_t rb32[4] = {0x00010203, 0x20212223, 0x30313233, 0x40414243}; >> asm("lxvw4x %x0, 0, %1 \n\t" \ >> : "=ws"(vrt32) : "r"()); >> printf("VRT32 = "); vec_put_u32(vrt32); >> >> Result On LE: >> VRT32 = 40414243 30313233 20212223 00010203 >> >> Result On BE: >> VRT32 = 03020100 23222120 33323130 43424140 > > Did you really recompile the test program for BE and LE, or did you just use > a > different command-line switch to qemu with the same executable image? I have two different images one LE and other BE and run it like this; ./ppc64le-linux-user/qemu-ppc64le -cpu POWER9 test-tcg/LElxvx ./ppc64-linux-user/qemu-ppc64 -cpu POWER9 test-tcg/BElxvx $ file LElxvx LElxvx: ELF 64-bit LSB executable, 64-bit PowerPC or cisco 7500, version 1 (GNU/Linux), statically linked, for GNU/Linux 2.6.32, not stripped $ file BElxvx BElxvx: ELF 64-bit MSB executable, 64-bit PowerPC or cisco 7500, version 1 (GNU/Linux), statically linked, for GNU/Linux 4.4.3, not stripped > I can't see how you could possibly get these results for BE. I have been confused since yesterday, so thought of asking you. Regards Nikunj
Re: [Qemu-devel] [PATCH 5/6] target-ppc: add lxvb16x and lxvh8x
On 08/10/2016 02:51 PM, Nikunj A Dadhania wrote: I can fix the BE case using following but not sure if that will be correct ! tcg_gen_qemu_ld_i64(xth, EA, ctx->mem_idx, MO_Q); gen_helper_bswap32x2(xth, xth); tcg_gen_addi_tl(EA, EA, 8); tcg_gen_qemu_ld_i64(xtl, EA, ctx->mem_idx, MO_Q); gen_helper_bswap32x2(xtl, xtl); This cannot be correct, because it assumes a host-dependent byte ordering. You should be able to see different results depending on a BE or LE host. __vector uint32_t vrt32; uint32_t rb32[4] = {0x00010203, 0x20212223, 0x30313233, 0x40414243}; asm("lxvw4x %x0, 0, %1 \n\t" \ : "=ws"(vrt32) : "r"()); printf("VRT32 = "); vec_put_u32(vrt32); Result On LE: VRT32 = 40414243 30313233 20212223 00010203 Result On BE: VRT32 = 03020100 23222120 33323130 43424140 Did you really recompile the test program for BE and LE, or did you just use a different command-line switch to qemu with the same executable image? I can't see how you could possibly get these results for BE. r~
Re: [Qemu-devel] [PATCH 5/6] target-ppc: add lxvb16x and lxvh8x
Richard Hendersonwrites: > On 08/08/2016 10:57 AM, Richard Henderson wrote: >> On 08/07/2016 11:06 PM, Nikunj A Dadhania wrote: >>> +#define LXV(name, access, swap, type, elems) \ >>> +uint64_t helper_##name(CPUPPCState *env, \ >>> + target_ulong addr)\ >>> +{\ >>> +type r[elems] = {0}; \ >>> +int i, index, bound, step; \ >>> +if (msr_le) {\ >>> +index = elems - 1; \ >>> +bound = -1; \ >>> +step = -1; \ >>> +} else {\ >>> +index = 0; \ >>> +bound = elems; \ >>> +step = 1;\ >>> +}\ >>> + \ >>> +for (i = index; i != bound; i += step) { \ >>> +if (needs_byteswap(env)) { \ >>> +r[i] = swap(access(env, addr, GETPC())); \ >>> +} else { \ >>> +r[i] = access(env, addr, GETPC()); \ >>> +}\ >>> +addr = addr_add(env, addr, sizeof(type));\ >>> +}\ >>> +return *((uint64_t *)r);\ >>> +} >> >> This looks more complicated than necessary. >> >> (1) In big-endian mode, surely this simplifies to two 64-bit big-endian >> loads. >> >> (2) In little-endian mode, the overhead of accessing memory surely dominates, >> and therefore we should perform two 64-bit loads and manipulate the data >> after. >> >> AFAICS, this is easiest done by requesting two 64-bit *big-endian* loads, and >> then swapping bytes. E.g. >> >> uint64_t helper_bswap16x4(uint64_t x) >> { >> uint64_t m = 0x00ff00ff00ff00ffull; >> return ((x & m) << 8) | ((x >> 8) & m); >> } >> >> uint64_t helper_bswap32x2(uint64_t x) >> { >> return deposit64(bswap32(x >> 32), 32, 32, bswap32(x)); >> } > > To correct myself, this big-endian load really only makes sense for lxvh8x. > For lxvw4x, a little-endian load with a word swap is fewer operations. I.e. > >tcg_gen_qemu_ld_i64(t0, addr, ctx->mem_idx, MO_LEQ); >tcg_gen_shri_i64(t1, t0, 32); >tcg_gen_deposit_i64(dest, t1, t0, 32, 32); I have following for lxvw4x: if (ctx->le_mode) { t0 = tcg_temp_new_i64(); t1 = tcg_temp_new_i64(); tcg_gen_qemu_ld_i64(t0, EA, ctx->mem_idx, MO_LEQ); tcg_gen_shri_i64(t1, t0, 32); tcg_gen_deposit_i64(xth, t1, t0, 32, 32); tcg_gen_addi_tl(EA, EA, 8); tcg_gen_qemu_ld_i64(t0, EA, ctx->mem_idx, MO_LEQ); tcg_gen_shri_i64(t1, t0, 32); tcg_gen_deposit_i64(xtl, t1, t0, 32, 32); tcg_temp_free_i64(t0); tcg_temp_free_i64(t1); } else { tcg_gen_qemu_ld_i64(xth, EA, ctx->mem_idx, MO_BEQ); tcg_gen_addi_tl(EA, EA, 8); tcg_gen_qemu_ld_i64(xtl, EA, ctx->mem_idx, MO_BEQ); } Here is the test code: __vector uint32_t vrt32; uint32_t rb32[4] = {0x00010203, 0x20212223, 0x30313233, 0x40414243}; asm("lxvw4x %x0, 0, %1 \n\t" \ : "=ws"(vrt32) : "r"()); printf("VRT32 = "); vec_put_u32(vrt32); Result On LE: VRT32 = 40414243 30313233 20212223 00010203 Result On BE: VRT32 = 03020100 23222120 33323130 43424140 I had have expected the the BE result to be VRT32 = 00010203 20212223 30313233 40414243 But seems there is something going under the hoods. Am I missing something here? I can fix the BE case using following but not sure if that will be correct ! tcg_gen_qemu_ld_i64(xth, EA, ctx->mem_idx, MO_Q); gen_helper_bswap32x2(xth, xth); tcg_gen_addi_tl(EA, EA, 8); tcg_gen_qemu_ld_i64(xtl, EA, ctx->mem_idx, MO_Q); gen_helper_bswap32x2(xtl, xtl); uint64_t helper_bswap32x2(uint64_t x) { return deposit64((x >> 32), 32, 32, (x)); } And then I get the expected result: VRT32 = 00010203 20212223 30313233 40414243 Regards, Nikunj
Re: [Qemu-devel] [PATCH 5/6] target-ppc: add lxvb16x and lxvh8x
On 08/08/2016 10:57 AM, Richard Henderson wrote: On 08/07/2016 11:06 PM, Nikunj A Dadhania wrote: +#define LXV(name, access, swap, type, elems) \ +uint64_t helper_##name(CPUPPCState *env, \ + target_ulong addr)\ +{\ +type r[elems] = {0}; \ +int i, index, bound, step; \ +if (msr_le) {\ +index = elems - 1; \ +bound = -1; \ +step = -1; \ +} else {\ +index = 0; \ +bound = elems; \ +step = 1;\ +}\ + \ +for (i = index; i != bound; i += step) { \ +if (needs_byteswap(env)) { \ +r[i] = swap(access(env, addr, GETPC())); \ +} else { \ +r[i] = access(env, addr, GETPC()); \ +}\ +addr = addr_add(env, addr, sizeof(type));\ +}\ +return *((uint64_t *)r);\ +} This looks more complicated than necessary. (1) In big-endian mode, surely this simplifies to two 64-bit big-endian loads. (2) In little-endian mode, the overhead of accessing memory surely dominates, and therefore we should perform two 64-bit loads and manipulate the data after. AFAICS, this is easiest done by requesting two 64-bit *big-endian* loads, and then swapping bytes. E.g. uint64_t helper_bswap16x4(uint64_t x) { uint64_t m = 0x00ff00ff00ff00ffull; return ((x & m) << 8) | ((x >> 8) & m); } uint64_t helper_bswap32x2(uint64_t x) { return deposit64(bswap32(x >> 32), 32, 32, bswap32(x)); } To correct myself, this big-endian load really only makes sense for lxvh8x. For lxvw4x, a little-endian load with a word swap is fewer operations. I.e. tcg_gen_qemu_ld_i64(t0, addr, ctx->mem_idx, MO_LEQ); tcg_gen_shri_i64(t1, t0, 32); tcg_gen_deposit_i64(dest, t1, t0, 32, 32); r~
Re: [Qemu-devel] [PATCH 5/6] target-ppc: add lxvb16x and lxvh8x
On 08/07/2016 11:06 PM, Nikunj A Dadhania wrote: +#define LXV(name, access, swap, type, elems) \ +uint64_t helper_##name(CPUPPCState *env, \ + target_ulong addr)\ +{\ +type r[elems] = {0}; \ +int i, index, bound, step; \ +if (msr_le) {\ +index = elems - 1; \ +bound = -1; \ +step = -1; \ +} else {\ +index = 0; \ +bound = elems; \ +step = 1;\ +}\ + \ +for (i = index; i != bound; i += step) { \ +if (needs_byteswap(env)) { \ +r[i] = swap(access(env, addr, GETPC())); \ +} else { \ +r[i] = access(env, addr, GETPC()); \ +}\ +addr = addr_add(env, addr, sizeof(type));\ +}\ +return *((uint64_t *)r);\ +} This looks more complicated than necessary. (1) In big-endian mode, surely this simplifies to two 64-bit big-endian loads. (2) In little-endian mode, the overhead of accessing memory surely dominates, and therefore we should perform two 64-bit loads and manipulate the data after. AFAICS, this is easiest done by requesting two 64-bit *big-endian* loads, and then swapping bytes. E.g. uint64_t helper_bswap16x4(uint64_t x) { uint64_t m = 0x00ff00ff00ff00ffull; return ((x & m) << 8) | ((x >> 8) & m); } uint64_t helper_bswap32x2(uint64_t x) { return deposit64(bswap32(x >> 32), 32, 32, bswap32(x)); } tcg_gen_qemu_ld_i64(dest, addr, MO_BEQ, s->mem_index); if (ctx->le_mode) { gen_helper_bswap16x4(dest, dest); } r~