---------- Forwarded message ----------
From: <[email protected]>
Date: Fri, Oct 4, 2013 at 4:27 PM
Subject: [x265] [PATCH] replace block_copy_p_p vector class function with
intrinsic code
To: [email protected]


         {
             for (int x = 0; x < bx; x += 16)
             {
-                Vec16c word;
-                word.load_a(src + x);
-                word.store_a(dst + x);
+                __m128i word0 = _mm_load_si128((__m128i const*)(src + x));
// load block of 16 byte from src
+                _mm_store_si128((__m128i*)&dst[x], word0); // store block
into dst
             }
Here also, I will suggest to do unroll for multiple of 8. use load function
for 64 bit. Suppose our x come some ting like 24, 25 we can store 16
elements from above loop  but for rest (for 25 it's 25-16 = 9) we have to
copy 9 elements as individuals. If we will add an unroll for 8. we have to
just copy (25 - 16 - 8 = 1) 1 element as individual. please add an unroll
loop for 8 and test it.



             src += sstride;

Regards,
Praveen
_______________________________________________
x265-devel mailing list
[email protected]
https://mailman.videolan.org/listinfo/x265-devel

Reply via email to