*Highlights*
*Details* I design and implement ARM NEON algorithm on DCT16x16, since ARM registers very limited, I design algorithm to process 16x4 everytime, and loop 4 times to process all of DCT-1D rows. the DCT-2D is similar but work on 32-bits intermedia (the 32-bits multiplication is bottleneck here, as compare to single cycle 16-bits multiplication, it is 4-cycles) *Plans* Write a example for psyCost_pp<2> (psyCost_pp_4x4) I need more ~2 weeks to finish the DCT16x16, the function too large and complex, I need more time to debug and adjust my algorithm / code, and I need average ~20 minutes to execute debug top (modify from our Testbench) in the simulate environment. Thank you Regards Ramya
_______________________________________________ x265-devel mailing list [email protected] https://mailman.videolan.org/listinfo/x265-devel
