Re: [Qemu-devel] [PATCH 5/6] target-ppc: add lxvb16x and lxvh8x

2016-08-10 Thread Nikunj A Dadhania
Richard Henderson  writes:

> On 08/10/2016 02:51 PM, Nikunj A Dadhania wrote:
>> I can fix the BE case using following but not sure if that will be
>> correct !
>>
>> tcg_gen_qemu_ld_i64(xth, EA, ctx->mem_idx, MO_Q);
>> gen_helper_bswap32x2(xth, xth);
>> tcg_gen_addi_tl(EA, EA, 8);
>> tcg_gen_qemu_ld_i64(xtl, EA, ctx->mem_idx, MO_Q);
>> gen_helper_bswap32x2(xtl, xtl);
>
> This cannot be correct, because it assumes a host-dependent byte ordering.  
> You 
> should be able to see different results depending on a BE or LE host.

Forgot to mention, this was just for the BE part. Because MO_BEQ wasnt
giving result as expected.

if (ctx->le_mode) {
/* Same as before */
} else {
tcg_gen_qemu_ld_i64(xth, EA, ctx->mem_idx, MO_Q);
gen_helper_bswap32x2(xth, xth);
tcg_gen_addi_tl(EA, EA, 8);
tcg_gen_qemu_ld_i64(xtl, EA, ctx->mem_idx, MO_Q);
gen_helper_bswap32x2(xtl, xtl);
}

Regards
Nikunj




Re: [Qemu-devel] [PATCH 5/6] target-ppc: add lxvb16x and lxvh8x

2016-08-10 Thread Nikunj A Dadhania
Richard Henderson  writes:

> On 08/10/2016 02:51 PM, Nikunj A Dadhania wrote:
>> I can fix the BE case using following but not sure if that will be
>> correct !
>>
>> tcg_gen_qemu_ld_i64(xth, EA, ctx->mem_idx, MO_Q);
>> gen_helper_bswap32x2(xth, xth);
>> tcg_gen_addi_tl(EA, EA, 8);
>> tcg_gen_qemu_ld_i64(xtl, EA, ctx->mem_idx, MO_Q);
>> gen_helper_bswap32x2(xtl, xtl);
>
> This cannot be correct, because it assumes a host-dependent byte ordering.  
> You 
> should be able to see different results depending on a BE or LE host.
>
>
>> __vector uint32_t vrt32;
>> uint32_t rb32[4] = {0x00010203, 0x20212223, 0x30313233, 0x40414243};
>> asm("lxvw4x %x0, 0, %1 \n\t" \
>>  : "=ws"(vrt32) : "r"());
>> printf("VRT32 = "); vec_put_u32(vrt32);
>>
>> Result On LE:
>> VRT32 = 40414243 30313233 20212223 00010203
>>
>> Result On BE:
>> VRT32 = 03020100 23222120 33323130 43424140
>
> Did you really recompile the test program for BE and LE, or did you just use 
> a 
> different command-line switch to qemu with the same executable image?

I have two different images one LE and other BE and run it like this;

./ppc64le-linux-user/qemu-ppc64le  -cpu POWER9 test-tcg/LElxvx
./ppc64-linux-user/qemu-ppc64  -cpu POWER9 test-tcg/BElxvx

$ file LElxvx 
LElxvx: ELF 64-bit LSB executable, 64-bit PowerPC or cisco 7500, version 1 
(GNU/Linux), statically linked, for GNU/Linux 2.6.32, not stripped
$ file BElxvx 
BElxvx: ELF 64-bit MSB executable, 64-bit PowerPC or cisco 7500, version 1 
(GNU/Linux), statically linked, for GNU/Linux 4.4.3, not stripped

> I can't see how you could possibly get these results for BE.

I have been confused since yesterday, so thought of asking you.

Regards
Nikunj




Re: [Qemu-devel] [PATCH 5/6] target-ppc: add lxvb16x and lxvh8x

2016-08-10 Thread Richard Henderson

On 08/10/2016 02:51 PM, Nikunj A Dadhania wrote:

I can fix the BE case using following but not sure if that will be
correct !

tcg_gen_qemu_ld_i64(xth, EA, ctx->mem_idx, MO_Q);
gen_helper_bswap32x2(xth, xth);
tcg_gen_addi_tl(EA, EA, 8);
tcg_gen_qemu_ld_i64(xtl, EA, ctx->mem_idx, MO_Q);
gen_helper_bswap32x2(xtl, xtl);


This cannot be correct, because it assumes a host-dependent byte ordering.  You 
should be able to see different results depending on a BE or LE host.




__vector uint32_t vrt32;
uint32_t rb32[4] = {0x00010203, 0x20212223, 0x30313233, 0x40414243};
asm("lxvw4x %x0, 0, %1 \n\t" \
 : "=ws"(vrt32) : "r"());
printf("VRT32 = "); vec_put_u32(vrt32);

Result On LE:
VRT32 = 40414243 30313233 20212223 00010203

Result On BE:
VRT32 = 03020100 23222120 33323130 43424140


Did you really recompile the test program for BE and LE, or did you just use a 
different command-line switch to qemu with the same executable image?  I can't 
see how you could possibly get these results for BE.



r~



Re: [Qemu-devel] [PATCH 5/6] target-ppc: add lxvb16x and lxvh8x

2016-08-10 Thread Nikunj A Dadhania
Richard Henderson  writes:

> On 08/08/2016 10:57 AM, Richard Henderson wrote:
>> On 08/07/2016 11:06 PM, Nikunj A Dadhania wrote:
>>> +#define LXV(name, access, swap, type, elems) \
>>> +uint64_t helper_##name(CPUPPCState *env, \
>>> +   target_ulong addr)\
>>> +{\
>>> +type r[elems] = {0}; \
>>> +int i, index, bound, step;   \
>>> +if (msr_le) {\
>>> +index = elems - 1;   \
>>> +bound = -1;  \
>>> +step = -1;   \
>>> +} else  {\
>>> +index = 0;   \
>>> +bound = elems;   \
>>> +step = 1;\
>>> +}\
>>> + \
>>> +for (i = index; i != bound; i += step) { \
>>> +if (needs_byteswap(env)) {   \
>>> +r[i] = swap(access(env, addr, GETPC())); \
>>> +} else { \
>>> +r[i] =  access(env, addr, GETPC());  \
>>> +}\
>>> +addr = addr_add(env, addr, sizeof(type));\
>>> +}\
>>> +return  *((uint64_t *)r);\
>>> +}
>>
>> This looks more complicated than necessary.
>>
>> (1) In big-endian mode, surely this simplifies to two 64-bit big-endian 
>> loads.
>>
>> (2) In little-endian mode, the overhead of accessing memory surely dominates,
>> and therefore we should perform two 64-bit loads and manipulate the data 
>> after.
>>
>> AFAICS, this is easiest done by requesting two 64-bit *big-endian* loads, and
>> then swapping bytes.  E.g.
>>
>> uint64_t helper_bswap16x4(uint64_t x)
>> {
>> uint64_t m = 0x00ff00ff00ff00ffull;
>> return ((x & m) << 8) | ((x >> 8) & m);
>> }
>>
>> uint64_t helper_bswap32x2(uint64_t x)
>> {
>> return deposit64(bswap32(x >> 32), 32, 32, bswap32(x));
>> }
>
> To correct myself, this big-endian load really only makes sense for lxvh8x.
> For lxvw4x, a little-endian load with a word swap is fewer operations.  I.e.
>
>tcg_gen_qemu_ld_i64(t0, addr, ctx->mem_idx, MO_LEQ);
>tcg_gen_shri_i64(t1, t0, 32);
>tcg_gen_deposit_i64(dest, t1, t0, 32, 32);


I have following for lxvw4x:

if (ctx->le_mode) {
t0 = tcg_temp_new_i64();
t1 = tcg_temp_new_i64();
tcg_gen_qemu_ld_i64(t0, EA, ctx->mem_idx, MO_LEQ);
tcg_gen_shri_i64(t1, t0, 32);
tcg_gen_deposit_i64(xth, t1, t0, 32, 32);
tcg_gen_addi_tl(EA, EA, 8);
tcg_gen_qemu_ld_i64(t0, EA, ctx->mem_idx, MO_LEQ);
tcg_gen_shri_i64(t1, t0, 32);
tcg_gen_deposit_i64(xtl, t1, t0, 32, 32);
tcg_temp_free_i64(t0);
tcg_temp_free_i64(t1);
} else {
tcg_gen_qemu_ld_i64(xth, EA, ctx->mem_idx, MO_BEQ);
tcg_gen_addi_tl(EA, EA, 8);
tcg_gen_qemu_ld_i64(xtl, EA, ctx->mem_idx, MO_BEQ);
}

Here is the test code:

__vector uint32_t vrt32;
uint32_t rb32[4] = {0x00010203, 0x20212223, 0x30313233, 0x40414243};
asm("lxvw4x %x0, 0, %1 \n\t" \
 : "=ws"(vrt32) : "r"());
printf("VRT32 = "); vec_put_u32(vrt32);

Result On LE:
VRT32 = 40414243 30313233 20212223 00010203

Result On BE:
VRT32 = 03020100 23222120 33323130 43424140

I had have expected the the BE result to be
VRT32 = 00010203 20212223 30313233 40414243

But seems there is something going under the hoods. Am I missing
something here?

I can fix the BE case using following but not sure if that will be
correct !

tcg_gen_qemu_ld_i64(xth, EA, ctx->mem_idx, MO_Q);
gen_helper_bswap32x2(xth, xth);
tcg_gen_addi_tl(EA, EA, 8);
tcg_gen_qemu_ld_i64(xtl, EA, ctx->mem_idx, MO_Q);
gen_helper_bswap32x2(xtl, xtl);

uint64_t helper_bswap32x2(uint64_t x)
{
return deposit64((x >> 32), 32, 32, (x));
}

And then I get the expected result:
VRT32 = 00010203 20212223 30313233 40414243

Regards,
Nikunj




Re: [Qemu-devel] [PATCH 5/6] target-ppc: add lxvb16x and lxvh8x

2016-08-08 Thread Richard Henderson

On 08/08/2016 10:57 AM, Richard Henderson wrote:

On 08/07/2016 11:06 PM, Nikunj A Dadhania wrote:

+#define LXV(name, access, swap, type, elems) \
+uint64_t helper_##name(CPUPPCState *env, \
+   target_ulong addr)\
+{\
+type r[elems] = {0}; \
+int i, index, bound, step;   \
+if (msr_le) {\
+index = elems - 1;   \
+bound = -1;  \
+step = -1;   \
+} else  {\
+index = 0;   \
+bound = elems;   \
+step = 1;\
+}\
+ \
+for (i = index; i != bound; i += step) { \
+if (needs_byteswap(env)) {   \
+r[i] = swap(access(env, addr, GETPC())); \
+} else { \
+r[i] =  access(env, addr, GETPC());  \
+}\
+addr = addr_add(env, addr, sizeof(type));\
+}\
+return  *((uint64_t *)r);\
+}


This looks more complicated than necessary.

(1) In big-endian mode, surely this simplifies to two 64-bit big-endian loads.

(2) In little-endian mode, the overhead of accessing memory surely dominates,
and therefore we should perform two 64-bit loads and manipulate the data after.

AFAICS, this is easiest done by requesting two 64-bit *big-endian* loads, and
then swapping bytes.  E.g.

uint64_t helper_bswap16x4(uint64_t x)
{
uint64_t m = 0x00ff00ff00ff00ffull;
return ((x & m) << 8) | ((x >> 8) & m);
}

uint64_t helper_bswap32x2(uint64_t x)
{
return deposit64(bswap32(x >> 32), 32, 32, bswap32(x));
}


To correct myself, this big-endian load really only makes sense for lxvh8x.
For lxvw4x, a little-endian load with a word swap is fewer operations.  I.e.

  tcg_gen_qemu_ld_i64(t0, addr, ctx->mem_idx, MO_LEQ);
  tcg_gen_shri_i64(t1, t0, 32);
  tcg_gen_deposit_i64(dest, t1, t0, 32, 32);


r~



Re: [Qemu-devel] [PATCH 5/6] target-ppc: add lxvb16x and lxvh8x

2016-08-07 Thread Richard Henderson

On 08/07/2016 11:06 PM, Nikunj A Dadhania wrote:

+#define LXV(name, access, swap, type, elems) \
+uint64_t helper_##name(CPUPPCState *env, \
+   target_ulong addr)\
+{\
+type r[elems] = {0}; \
+int i, index, bound, step;   \
+if (msr_le) {\
+index = elems - 1;   \
+bound = -1;  \
+step = -1;   \
+} else  {\
+index = 0;   \
+bound = elems;   \
+step = 1;\
+}\
+ \
+for (i = index; i != bound; i += step) { \
+if (needs_byteswap(env)) {   \
+r[i] = swap(access(env, addr, GETPC())); \
+} else { \
+r[i] =  access(env, addr, GETPC());  \
+}\
+addr = addr_add(env, addr, sizeof(type));\
+}\
+return  *((uint64_t *)r);\
+}


This looks more complicated than necessary.

(1) In big-endian mode, surely this simplifies to two 64-bit big-endian loads.

(2) In little-endian mode, the overhead of accessing memory surely dominates, 
and therefore we should perform two 64-bit loads and manipulate the data after.


AFAICS, this is easiest done by requesting two 64-bit *big-endian* loads, and 
then swapping bytes.  E.g.


uint64_t helper_bswap16x4(uint64_t x)
{
uint64_t m = 0x00ff00ff00ff00ffull;
return ((x & m) << 8) | ((x >> 8) & m);
}

uint64_t helper_bswap32x2(uint64_t x)
{
return deposit64(bswap32(x >> 32), 32, 32, bswap32(x));
}


tcg_gen_qemu_ld_i64(dest, addr, MO_BEQ, s->mem_index);
if (ctx->le_mode) {
gen_helper_bswap16x4(dest, dest);
}


r~