Am 31.10.2016 um 19:30 schrieb Matt Turner:
On Mon, Oct 31, 2016 at 11:16 AM, Jochen Rollwagen
<[email protected]> wrote:
Am 31.10.2016 um 07:01 schrieb Matt Turner:
On Fri, Oct 28, 2016 at 1:28 AM, Jochen Rollwagen <[email protected]>
wrote:
Hi there,
gcc seems to create some sub-optimal code for the following code sequence
in
radeon_accel.c:
for (; nwords > 0; --nwords, ++d, ++s)
*d = ((*s & 0xffff) << 16) | ((*s >> 16) & 0xffff);
the body of the loop compiles to
lwz 9,40(31)
lwz 9,0(9)
rotlwi 10,9,16
lwz 9,36(31)
stw 10,0(9)
lwz 9,44(31)
addi 9,9,-1
stw 9,44(31)
lwz 9,36(31)
addi 9,9,4
stw 9,36(31)
lwz 9,40(31)
addi 9,9,4
stw 9,40(31)
this patch adds some (hopefully optimal) assembler code, bringing it in
line
with the other cases in the switch:
diff --git a/src/radeon_accel.c b/src/radeon_accel.c
index 1def2a3..580fa33 100644
--- a/src/radeon_accel.c
+++ b/src/radeon_accel.c
@@ -138,7 +138,16 @@ void RADEONCopySwap(uint8_t *dst, uint8_t *src,
unsigned int size, int swap)
unsigned int nwords = size >> 2;
for (; nwords > 0; --nwords, ++d, ++s)
- *d = ((*s & 0xffff) << 16) | ((*s >> 16) & 0xffff);
+#ifdef __powerpc__
+ __asm__ volatile ("rlwinm %0,%1,%2,%3,%4\n\t"
+ "rlwimi
%0,%1,%5,%6,%7\n\t"
+ : "=&r" (*d)
+ : "r" (*s),"i" (16),
"i"
(16),"i" (31) ,"i" (16), "i" (0),"i" (15)
+ :);
+
+#else
+ *d = ((*s & 0xffff) << 16) | ((*s >> 16) &
0xffff);
+#endif
It looks like this code just swaps the two 16-bit shorts. That looks
like a rotate by 16. I don't know anything about PPC assembly, but if
this description [1] of the rlwinm is correct, you should be able to
do it in a single rlwinm instruction, right? E.g., __rlwinm(*s, 16, 0,
31).
Also, if it's possible it would be more readable to embed the
immediates in the assembly text string instead of "i" inputs.
[1] https://gist.github.com/rygorous/1440600
Hi Matt,
thanks for your review.
You’re perfectly right, the one rlwinm does the trick –with half the
instructions (one).
Which very probably really makes this „the optimal“ code sequence.
Cool!
As for the integers, putting them directly in the asm statement seems to
make gcc’s inline asm interpret them as register names (r0, r16, r31).
There’s probably a funky compiler switch for that, but i guess the safest,
cleanest version is the one below.
You might try prefixing them with a # or $. Some assembly language's
syntax requires such a prefix for immediate values.
Also, if you want you might be interested in replacing the other
#ifdef __powerpc__ cases later in that file with uses of
__builtin_bswap*.
Hi again,
thanks for your suggestions.
I didn’t find any code examples or gcc documentation for your
suggestions („#“ or „$“ or whatever), so i guess it’s not possible to
replace the "i" constraints with gcc’s inline asm. Gcc’s inline asm
constraints are tricky enough, so i wouldn‘t want to experiment, and the
code works the way it is. Feel free to experiment and change them (if
you can find a powerpc machine to compile on, hehehe).
As for the gcc builtin’s:
1.they call (or are replaced by) the asm statements already in the code
anyway on powerpc platforms. So that would be mainly cosmetics
2. replacing the #ifdef powerpc’s + the rest of the code ( the #else
path) with __builtin_bswap’s *SHOULD* of course work on all platforms,
but i can only test it on powerpc and wouldn’t want to break anything else.
3. Besides, i’d have to build the driver again, which 3.1. I’m frankly
too lazy and 3.2. I don’t have the time for.
So as far as i’m concerned the last version of the patch i sent is the
final one (i’m attaching it). The builtin_bswap stuff could of course be
done as a separate patch by somebody able to build & test it on a more
exotic platform, like, say, some x86-variant :-). You can „deal me in“
as tester for powerpc, in that case.
On my machine I’m seeing quite noticeable performance improvements with
the patch. Maybe you could get it into mainline, I’d be interested
whether other powerpc holdouts see them, too, or it’s just me (it’s my
code, anyway, which might influence my perception – I’m sure it does !).
diff --git a/src/radeon_accel.c b/src/radeon_accel.c
index 1def2a3..2de2b96 100644
--- a/src/radeon_accel.c
+++ b/src/radeon_accel.c
@@ -137,8 +137,16 @@ void RADEONCopySwap(uint8_t *dst, uint8_t *src, unsigned int size, int swap)
unsigned int *s = (unsigned int *)src;
unsigned int nwords = size >> 2;
- for (; nwords > 0; --nwords, ++d, ++s)
- *d = ((*s & 0xffff) << 16) | ((*s >> 16) & 0xffff);
+ for (; nwords > 0; --nwords, ++d, ++s)
+#ifdef __powerpc__
+ __asm__ volatile ("rlwinm %0,%1,%2,%3,%4\n\t"
+ : "=&r" (*d)
+ : "r" (*s),"i" (16), "i" (0),"i" (31)
+ :);
+
+#else
+ *d = ((*s & 0xffff) << 16) | ((*s >> 16) & 0xffff);
+#endif
return;
}
case RADEON_HOST_DATA_SWAP_32BIT:
_______________________________________________
xorg-driver-ati mailing list
[email protected]
https://lists.x.org/mailman/listinfo/xorg-driver-ati