---------- Forwarded message ----------
From: <[email protected]>
Date: Fri, Oct 4, 2013 at 4:27 PM
Subject: [x265] [PATCH] replace block_copy_p_p vector class function with
intrinsic code
To: [email protected]
{
for (int x = 0; x < bx; x += 16)
{
- Vec16c word;
- word.load_a(src + x);
- word.store_a(dst + x);
+ __m128i word0 = _mm_load_si128((__m128i const*)(src + x));
// load block of 16 byte from src
+ _mm_store_si128((__m128i*)&dst[x], word0); // store block
into dst
}
Here also, I will suggest to do unroll for multiple of 8. use load function
for 64 bit. Suppose our x come some ting like 24, 25 we can store 16
elements from above loop but for rest (for 25 it's 25-16 = 9) we have to
copy 9 elements as individuals. If we will add an unroll for 8. we have to
just copy (25 - 16 - 8 = 1) 1 element as individual. please add an unroll
loop for 8 and test it.
src += sstride;
Regards,
Praveen
_______________________________________________
x265-devel mailing list
[email protected]
https://mailman.videolan.org/listinfo/x265-devel