Here are some extra implementations that extend Christophe's work. The first one (SSE) is only for x86_32 targets as x86_64 guarantees SSE2 is available.
Second patch is an AVX implementation using ymm registers. In my tests it was about 30 cycles faster than SSE2 on a Sandy Bridge CPU. I don't have proper numbers for the third patch since i could only test on an AMD rig, where functions using ymm registers tend to have subpar performance. It still beat the AVX version by a decent marging, though, so Haswell should see a nice boost with it. I could add an FMA4 version using xmm registers, which would benefit AMD users unlike these AVX/FMA3 ymm ones. Thoughts? James Almer (3): x86/synth_filter: add synth_filter_fma3 x86/synth_filter: add synth_filter_sse x86/synth_filter: add synth_filter_avx libavcodec/x86/dcadsp.asm | 109 ++++++++++++++++++++++++++++--------------- libavcodec/x86/dcadsp_init.c | 52 ++++++++++++++------- 2 files changed, 107 insertions(+), 54 deletions(-) -- 1.8.3.2 _______________________________________________ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel