Re: [libav-devel] [PATCH] utvideodec: Add a missing include
On Fri, 10 Feb 2017, Martin Storsjö wrote: This was missing from 77c23704c76, fixing building. --- libavcodec/utvideodec.c | 1 + 1 file changed, 1 insertion(+) diff --git a/libavcodec/utvideodec.c b/libavcodec/utvideodec.c index 381b4f7..808e3be 100644 --- a/libavcodec/utvideodec.c +++ b/libavcodec/utvideodec.c @@ -33,6 +33,7 @@ #include "bitstream.h" #include "bswapdsp.h" #include "bytestream.h" +#include "internal.h" #include "thread.h" #include "utvideo.h" -- 2.10.1 (Apple Git-78) Approved by wm4 on irc, pushed. // Martin ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel
[libav-devel] [PATCH] utvideodec: Add a missing include
This was missing from 77c23704c76, fixing building. --- libavcodec/utvideodec.c | 1 + 1 file changed, 1 insertion(+) diff --git a/libavcodec/utvideodec.c b/libavcodec/utvideodec.c index 381b4f7..808e3be 100644 --- a/libavcodec/utvideodec.c +++ b/libavcodec/utvideodec.c @@ -33,6 +33,7 @@ #include "bitstream.h" #include "bswapdsp.h" #include "bytestream.h" +#include "internal.h" #include "thread.h" #include "utvideo.h" -- 2.10.1 (Apple Git-78) ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel
Re: [libav-devel] [PATCH] travis: Ignore the filter-fade test
On Thu, Feb 9, 2017 at 8:30 PM, Luca Barbatowrote: > On 26/01/2017 12:42, Luca Barbato wrote: >> It glitches with the stale travis linux target. >> --- >> >> .travis.yml | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/.travis.yml b/.travis.yml >> index 8e9629a..f7dab48 100644 >> --- a/.travis.yml >> +++ b/.travis.yml >> @@ -20,7 +20,7 @@ install: >>- if [ "$TRAVIS_OS_NAME" == "osx" ]; then brew install yasm; fi >> script: >>- mkdir -p libav-samples >> - - ./configure --samples=libav-samples --cc=$CC >> + - ./configure --samples=libav-samples --cc=$CC --ignore-tests=filter-fade >>- make -j 8 >>- make fate-rsync >>- make check -j 8 >> -- >> 2.9.2 > > Ping, I'd merge it tomorrow, not into figuring out what makes that > combination of compiler and vm upset with that specific filter. I'm ok with it but please change the commit log to something more descriptive. -- Vittorio ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel
Re: [libav-devel] [PATCH] travis: Ignore the filter-fade test
On 26/01/2017 12:42, Luca Barbato wrote: > It glitches with the stale travis linux target. > --- > > .travis.yml | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/.travis.yml b/.travis.yml > index 8e9629a..f7dab48 100644 > --- a/.travis.yml > +++ b/.travis.yml > @@ -20,7 +20,7 @@ install: >- if [ "$TRAVIS_OS_NAME" == "osx" ]; then brew install yasm; fi > script: >- mkdir -p libav-samples > - - ./configure --samples=libav-samples --cc=$CC > + - ./configure --samples=libav-samples --cc=$CC --ignore-tests=filter-fade >- make -j 8 >- make fate-rsync >- make check -j 8 > -- > 2.9.2 Ping, I'd merge it tomorrow, not into figuring out what makes that combination of compiler and vm upset with that specific filter. lu ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel
[libav-devel] [PATCH] hlsenc: Correctly write down all 16 bytes in hex
--- libavformat/hlsenc.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/libavformat/hlsenc.c b/libavformat/hlsenc.c index 05c9adb..7aef02b 100644 --- a/libavformat/hlsenc.c +++ b/libavformat/hlsenc.c @@ -102,11 +102,12 @@ static void free_encryption(AVFormatContext *s) av_freep(>key_basename); } -static int dict_set_bin(AVDictionary **dict, const char *key, uint8_t *buf) +static int dict_set_bin(AVDictionary **dict, const char *key, +uint8_t *buf, size_t len) { char hex[33]; -ff_data_to_hex(hex, buf, sizeof(buf), 0); +ff_data_to_hex(hex, buf, len, 0); hex[32] = '\0'; return av_dict_set(dict, key, hex, 0); @@ -136,7 +137,7 @@ static int setup_encryption(AVFormatContext *s) return AVERROR(EINVAL); } -if ((ret = dict_set_bin(>enc_opts, "key", hls->key)) < 0) +if ((ret = dict_set_bin(>enc_opts, "key", hls->key, hls->key_len)) < 0) return ret; k = hls->key; } else { @@ -145,7 +146,7 @@ static int setup_encryption(AVFormatContext *s) return ret; } -if ((ret = dict_set_bin(>enc_opts, "key", buf)) < 0) +if ((ret = dict_set_bin(>enc_opts, "key", buf, sizeof(buf))) < 0) return ret; k = buf; } @@ -158,7 +159,7 @@ static int setup_encryption(AVFormatContext *s) return AVERROR(EINVAL); } -if ((ret = dict_set_bin(>enc_opts, "iv", hls->iv)) < 0) +if ((ret = dict_set_bin(>enc_opts, "iv", hls->iv, hls->iv_len)) < 0) return ret; } -- 2.9.2 ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel
Re: [libav-devel] [PATCH] hlsenc: Correctly write down all 16 bytes in hex
On 09/02/2017 20:21, Anton Khirnov wrote: > Looks very unsafe. Just pass the buffer size as a function parameter. Ok, the size must be 16 though. ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel
Re: [libav-devel] [PATCH] avcodec/nvenc: make gpu indices independent of supported capabilities
Thanks Luca. Please help submit the patch. Ganapathy -Original Message- From: libav-devel [mailto:libav-devel-boun...@libav.org] On Behalf Of Luca Barbato Sent: Wednesday, February 8, 2017 4:18 PM To: libav-devel@libav.org Subject: Re: [libav-devel] [PATCH] avcodec/nvenc: make gpu indices independent of supported capabilities On 08/02/2017 23:52, Ganapathy Raman Kasi wrote: > Hi, > > This patch fixes multiple unnecessary cuda contexts which are created > incase the gpu device to use is greater than 0. Each cuda context > creation takes about 100ms and this patch helps in reducing the > initialization time incase we are using one of the secondary gpus in > the system. The patch is being ported to libav. Please let me know if > there is a better way to port patches. Thanks. Looks good, thank you! and sending like this is ok, if you could use git send-email it might be faster, but I guess depends on your mailing system. lu ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel --- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. --- ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel
Re: [libav-devel] [PATCH] hlsenc: Correctly write down all 16 bytes in hex
Quoting Luca Barbato (2017-02-08 13:42:30) > --- > > libavformat/hlsenc.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/libavformat/hlsenc.c b/libavformat/hlsenc.c > index 05c9adb..3496bdd 100644 > --- a/libavformat/hlsenc.c > +++ b/libavformat/hlsenc.c > @@ -106,7 +106,7 @@ static int dict_set_bin(AVDictionary **dict, const char > *key, uint8_t *buf) > { > char hex[33]; > > -ff_data_to_hex(hex, buf, sizeof(buf), 0); > +ff_data_to_hex(hex, buf, 16, 0); > hex[32] = '\0'; > > return av_dict_set(dict, key, hex, 0); > -- > 2.9.2 Looks very unsafe. Just pass the buffer size as a function parameter. -- Anton Khirnov ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel
Re: [libav-devel] [PATCH 1/6] arm: vp9itxfm: Share instructions for loading idct coeffs in the 8x8 function
On 2017-02-09 14:29:56 +0200, Martin Storsjö wrote: > --- > libavcodec/arm/vp9itxfm_neon.S | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S > index 167d517..3d0b0fa 100644 > --- a/libavcodec/arm/vp9itxfm_neon.S > +++ b/libavcodec/arm/vp9itxfm_neon.S > @@ -412,13 +412,12 @@ function ff_vp9_\txfm1\()_\txfm2\()_8x8_add_neon, > export=1 > .ifc \txfm1\()_\txfm2,idct_idct > movrel r12, idct_coeffs > vpush {q4-q5} > -vld1.16 {q0}, [r12,:128] > .else > movrel r12, iadst8_coeffs > vld1.16 {q1}, [r12,:128]! > vpush {q4-q7} > -vld1.16 {q0}, [r12,:128] > .endif > +vld1.16 {q0}, [r12,:128] > > vmov.i16q2, #0 > vmov.i16q3, #0 the whole set is ok Janne ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel
Re: [libav-devel] [PATCH] dv: Convert to the new bitreader
On 09/02/2017 17:42, Diego Biurrun wrote: > I guess the size variables should have size_t type ;-p As you like. ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel
Re: [libav-devel] [PATCH] dv: Convert to the new bitreader
On Thu, Feb 09, 2017 at 05:41:21PM +0100, Diego Biurrun wrote: > --- a/libavcodec/bitstream.h > +++ b/libavcodec/bitstream.h > @@ -384,4 +384,32 @@ static inline int bitstream_apply_sign(BitstreamContext > *bc, int val) > > +static inline void bitstream_unget(BitstreamContext *bc, uint64_t value, int > size) > +{ > +int cache_size = sizeof(bc->bits) * 8; I guess the size variables should have size_t type ;-p Diego ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel
[libav-devel] [PATCH] dv: Convert to the new bitreader
From: Luca Barbato--- Moved the bitstream_unwind() and bitstream_unget() functions to bitstream.h as requested by Anton. libavcodec/bitstream.h | 28 +++ libavcodec/dvdec.c | 94 ++ 2 files changed, 69 insertions(+), 53 deletions(-) diff --git a/libavcodec/bitstream.h b/libavcodec/bitstream.h index 996e32e..f75a35c 100644 --- a/libavcodec/bitstream.h +++ b/libavcodec/bitstream.h @@ -384,4 +384,32 @@ static inline int bitstream_apply_sign(BitstreamContext *bc, int val) return (val ^ sign) - sign; } +/* Unwind the cache so a refill_32 can fill it again. */ +static inline void bitstream_unwind(BitstreamContext *bc) +{ +int unwind = 4; +int unwind_bits = unwind * 8; + +if (bc->bits_left < unwind_bits) +return; + +bc->bits >>= unwind_bits; +bc->bits <<= unwind_bits; +bc->bits_left -= unwind_bits; +bc->ptr-= unwind; +} + +/* Unget up to 32 bits. */ +static inline void bitstream_unget(BitstreamContext *bc, uint64_t value, int size) +{ +int cache_size = sizeof(bc->bits) * 8; + +if (bc->bits_left + size > cache_size) +bitstream_unwind(bc); + +bc->bits = (bc->bits >> size) | (value << (cache_size - size)); + +bc->bits_left += size; +} + #endif /* AVCODEC_BITSTREAM_H */ diff --git a/libavcodec/dvdec.c b/libavcodec/dvdec.c index dc37a5e..a2f0171 100644 --- a/libavcodec/dvdec.c +++ b/libavcodec/dvdec.c @@ -40,9 +40,9 @@ #include "libavutil/pixdesc.h" #include "avcodec.h" +#include "bitstream.h" #include "dv.h" #include "dvdata.h" -#include "get_bits.h" #include "idctdsp.h" #include "internal.h" #include "put_bits.h" @@ -80,51 +80,34 @@ static av_cold int dvvideo_decode_init(AVCodecContext *avctx) } /* decode AC coefficients */ -static void dv_decode_ac(GetBitContext *gb, BlockInfo *mb, int16_t *block) +static void dv_decode_ac(BitstreamContext *bc, BlockInfo *mb, int16_t *block) { -int last_index = gb->size_in_bits; const uint8_t *scan_table = mb->scan_table; const uint32_t *factor_table = mb->factor_table; int pos = mb->pos; int partial_bit_count= mb->partial_bit_count; -int level, run, vlc_len, index; - -OPEN_READER_NOSIZE(re, gb); -UPDATE_CACHE(re, gb); +int level, run; /* if we must parse a partial VLC, we do it here */ if (partial_bit_count > 0) { -re_cache = re_cache >> partial_bit_count | -mb->partial_bit_buffer; -re_index -= partial_bit_count; +bitstream_unget(bc, mb->partial_bit_buffer, partial_bit_count); mb->partial_bit_count = 0; } /* get the AC coefficients until last_index is reached */ for (;;) { -ff_dlog(NULL, "%2d: bits=%04x index=%u\n", pos, SHOW_UBITS(re, gb, 16), -re_index); -/* our own optimized GET_RL_VLC */ -index = NEG_USR32(re_cache, TEX_VLC_BITS); -vlc_len = ff_dv_rl_vlc[index].len; -if (vlc_len < 0) { -index = NEG_USR32((unsigned) re_cache << TEX_VLC_BITS, -vlc_len) + -ff_dv_rl_vlc[index].level; -vlc_len = TEX_VLC_BITS - vlc_len; -} -level = ff_dv_rl_vlc[index].level; -run = ff_dv_rl_vlc[index].run; - -/* gotta check if we're still within gb boundaries */ -if (re_index + vlc_len > last_index) { -/* should be < 16 bits otherwise a codeword could have been parsed */ -mb->partial_bit_count = last_index - re_index; -mb->partial_bit_buffer = re_cache & ~(-1u >> mb->partial_bit_count); -re_index = last_index; +BitstreamContext tmp = *bc; + +ff_dlog(NULL, "%2d: bits=%04x index=%d\n", +pos, bitstream_peek(bc, 16), bitstream_tell(bc)); + +BITSTREAM_RL_VLC(level, run, bc, ff_dv_rl_vlc, TEX_VLC_BITS, 2); + +if (bitstream_bits_left(bc) < 0) { +mb->partial_bit_count = bitstream_bits_left(); +mb->partial_bit_buffer = bitstream_peek(, mb->partial_bit_count); break; } -re_index += vlc_len; - ff_dlog(NULL, "run=%d level=%d\n", run, level); pos += run; if (pos >= 64) @@ -133,22 +116,22 @@ static void dv_decode_ac(GetBitContext *gb, BlockInfo *mb, int16_t *block) level = (level * factor_table[pos] + (1 << (dv_iweight_bits - 1))) >> dv_iweight_bits; block[scan_table[pos]] = level; - -UPDATE_CACHE(re, gb); } -CLOSE_READER(re, gb); mb->pos = pos; } -static inline void bit_copy(PutBitContext *pb, GetBitContext *gb) +static inline void bit_copy(PutBitContext *pb, BitstreamContext *bc) { -int bits_left = get_bits_left(gb); -while (bits_left >= MIN_CACHE_BITS) { -put_bits(pb, MIN_CACHE_BITS, get_bits(gb, MIN_CACHE_BITS)); -
Re: [libav-devel] [PATCH 6/6] aarch64: vp9itxfm: Fix incorrect vertical alignment
On Thu, Feb 09, 2017 at 02:30:01PM +0200, Martin Storsjö wrote: > --- > libavcodec/aarch64/vp9itxfm_neon.S | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) OK Diego ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel
[libav-devel] [PATCH 4/4] aarch64: vp9itxfm: Reorder iadst16 coeffs
This matches the order they are in the 16 bpp version. There they are in this order, to make sure we access them in the same order they are declared, easing loading only half of the coefficients at a time. This makes the 8 bpp version match the 16 bpp version better. --- libavcodec/aarch64/vp9itxfm_neon.S | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index f87f6bd..7b7dbd4 100644 --- a/libavcodec/aarch64/vp9itxfm_neon.S +++ b/libavcodec/aarch64/vp9itxfm_neon.S @@ -37,8 +37,8 @@ idct_coeffs: endconst const iadst16_coeffs, align=4 -.short 16364, 804, 15893, 3981, 14811, 7005, 13160, 9760 -.short 11003, 12140, 8423, 14053, 5520, 15426, 2404, 16207 +.short 16364, 804, 15893, 3981, 11003, 12140, 8423, 14053 +.short 14811, 7005, 13160, 9760, 5520, 15426, 2404, 16207 endconst // out1 = ((in1 + in2) * d0[0] + (1 << 13)) >> 14 @@ -622,19 +622,19 @@ function iadst16 ld1 {v0.8h,v1.8h}, [x11] dmbutterfly_l v6, v7, v4, v5, v31, v16, v0.h[1], v0.h[0] // v6,v7 = t1, v4,v5 = t0 -dmbutterfly_l v10, v11, v8, v9, v23, v24, v1.h[1], v1.h[0] // v10,v11 = t9, v8,v9 = t8 +dmbutterfly_l v10, v11, v8, v9, v23, v24, v0.h[5], v0.h[4] // v10,v11 = t9, v8,v9 = t8 dbutterfly_nv31, v24, v6, v7, v10, v11, v12, v13, v10, v11 // v31 = t1a, v24 = t9a dmbutterfly_l v14, v15, v12, v13, v29, v18, v0.h[3], v0.h[2] // v14,v15 = t3, v12,v13 = t2 dbutterfly_nv16, v23, v4, v5, v8, v9, v6, v7, v8, v9 // v16 = t0a, v23 = t8a -dmbutterfly_l v6, v7, v4, v5, v21, v26, v1.h[3], v1.h[2] // v6,v7 = t11, v4,v5 = t10 +dmbutterfly_l v6, v7, v4, v5, v21, v26, v0.h[7], v0.h[6] // v6,v7 = t11, v4,v5 = t10 dbutterfly_nv29, v26, v14, v15, v6, v7, v8, v9, v6, v7 // v29 = t3a, v26 = t11a -dmbutterfly_l v10, v11, v8, v9, v27, v20, v0.h[5], v0.h[4] // v10,v11 = t5, v8,v9 = t4 +dmbutterfly_l v10, v11, v8, v9, v27, v20, v1.h[1], v1.h[0] // v10,v11 = t5, v8,v9 = t4 dbutterfly_nv18, v21, v12, v13, v4, v5, v6, v7, v4, v5 // v18 = t2a, v21 = t10a dmbutterfly_l v14, v15, v12, v13, v19, v28, v1.h[5], v1.h[4] // v14,v15 = t13, v12,v13 = t12 dbutterfly_nv20, v28, v10, v11, v14, v15, v4, v5, v14, v15 // v20 = t5a, v28 = t13a -dmbutterfly_l v6, v7, v4, v5, v25, v22, v0.h[7], v0.h[6] // v6,v7 = t7, v4,v5 = t6 +dmbutterfly_l v6, v7, v4, v5, v25, v22, v1.h[3], v1.h[2] // v6,v7 = t7, v4,v5 = t6 dbutterfly_nv27, v19, v8, v9, v12, v13, v10, v11, v12, v13 // v27 = t4a, v19 = t12a dmbutterfly_l v10, v11, v8, v9, v17, v30, v1.h[7], v1.h[6] // v10,v11 = t15, v8,v9 = t14 -- 2.7.4 ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel
[libav-devel] [PATCH 2/4] aarch64: vp9itxfm: Reorder the idct coefficients for better pairing
All elements are used pairwise, except for the first one. Previously, the 16th element was unused. Move the unused element to the second slot, to make the later element pairs not split across registers. This simplifies loading only parts of the coefficients, reducing the difference to the 16 bpp version. --- libavcodec/aarch64/vp9itxfm_neon.S | 124 ++--- 1 file changed, 62 insertions(+), 62 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index c954d1a..f87f6bd 100644 --- a/libavcodec/aarch64/vp9itxfm_neon.S +++ b/libavcodec/aarch64/vp9itxfm_neon.S @@ -22,7 +22,7 @@ #include "neon.S" const itxfm4_coeffs, align=4 -.short 11585, 6270, 15137, 0 +.short 11585, 0, 6270, 15137 iadst4_coeffs: .short 5283, 15212, 9929, 13377 endconst @@ -30,8 +30,8 @@ endconst const iadst8_coeffs, align=4 .short 16305, 1606, 14449, 7723, 10394, 12665, 4756, 15679 idct_coeffs: -.short 11585, 6270, 15137, 3196, 16069, 13623, 9102, 1606 -.short 16305, 12665, 10394, 7723, 14449, 15679, 4756, 0 +.short 11585, 0, 6270, 15137, 3196, 16069, 13623, 9102 +.short 1606, 16305, 12665, 10394, 7723, 14449, 15679, 4756 .short 804, 16364, 12140, 11003, 7005, 14811, 15426, 5520 .short 3981, 15893, 14053, 8423, 9760, 13160, 16207, 2404 endconst @@ -192,14 +192,14 @@ endconst .endm .macro idct4 c0, c1, c2, c3 -smull v22.4s,\c1\().4h, v0.h[2] -smull v20.4s,\c1\().4h, v0.h[1] +smull v22.4s,\c1\().4h, v0.h[3] +smull v20.4s,\c1\().4h, v0.h[2] add v16.4h,\c0\().4h, \c2\().4h sub v17.4h,\c0\().4h, \c2\().4h -smlal v22.4s,\c3\().4h, v0.h[1] +smlal v22.4s,\c3\().4h, v0.h[2] smull v18.4s,v16.4h,v0.h[0] smull v19.4s,v17.4h,v0.h[0] -smlsl v20.4s,\c3\().4h, v0.h[2] +smlsl v20.4s,\c3\().4h, v0.h[3] rshrn v22.4h,v22.4s,#14 rshrn v18.4h,v18.4s,#14 rshrn v19.4h,v19.4s,#14 @@ -326,9 +326,9 @@ itxfm_func4x4 iwht, iwht .macro idct8 dmbutterfly0v16, v20, v16, v20, v2, v3, v4, v5, v6, v7 // v16 = t0a, v20 = t1a -dmbutterfly v18, v22, v0.h[1], v0.h[2], v2, v3, v4, v5 // v18 = t2a, v22 = t3a -dmbutterfly v17, v23, v0.h[3], v0.h[4], v2, v3, v4, v5 // v17 = t4a, v23 = t7a -dmbutterfly v21, v19, v0.h[5], v0.h[6], v2, v3, v4, v5 // v21 = t5a, v19 = t6a +dmbutterfly v18, v22, v0.h[2], v0.h[3], v2, v3, v4, v5 // v18 = t2a, v22 = t3a +dmbutterfly v17, v23, v0.h[4], v0.h[5], v2, v3, v4, v5 // v17 = t4a, v23 = t7a +dmbutterfly v21, v19, v0.h[6], v0.h[7], v2, v3, v4, v5 // v21 = t5a, v19 = t6a butterfly_8hv24, v25, v16, v22 // v24 = t0, v25 = t3 butterfly_8hv28, v29, v17, v21 // v28 = t4, v29 = t5a @@ -361,8 +361,8 @@ itxfm_func4x4 iwht, iwht dmbutterfly0v19, v20, v6, v7, v24, v26, v27, v28, v29, v30 // v19 = -out[3], v20 = out[4] neg v19.8h, v19.8h // v19 = out[3] -dmbutterfly_l v26, v27, v28, v29, v5, v3, v0.h[1], v0.h[2] // v26,v27 = t5a, v28,v29 = t4a -dmbutterfly_l v2, v3, v4, v5, v31, v25, v0.h[2], v0.h[1] // v2,v3 = t6a, v4,v5 = t7a +dmbutterfly_l v26, v27, v28, v29, v5, v3, v0.h[2], v0.h[3] // v26,v27 = t5a, v28,v29 = t4a +dmbutterfly_l v2, v3, v4, v5, v31, v25, v0.h[3], v0.h[2] // v2,v3 = t6a, v4,v5 = t7a dbutterfly_nv17, v30, v28, v29, v2, v3, v6, v7, v24, v25 // v17 = -out[1], v30 = t6 dbutterfly_nv22, v31, v26, v27, v4, v5, v6, v7, v24, v25 // v22 = out[6], v31 = t7 @@ -537,13 +537,13 @@ endfunc function idct16 dmbutterfly0v16, v24, v16, v24, v2, v3, v4, v5, v6, v7 // v16 = t0a, v24 = t1a -dmbutterfly v20, v28, v0.h[1], v0.h[2], v2, v3, v4, v5 // v20 = t2a, v28 = t3a -dmbutterfly v18, v30, v0.h[3], v0.h[4], v2, v3, v4, v5 // v18 = t4a, v30 = t7a -dmbutterfly v26, v22, v0.h[5], v0.h[6], v2, v3, v4, v5 // v26 = t5a, v22 = t6a -dmbutterfly v17, v31, v0.h[7], v1.h[0], v2, v3, v4, v5 // v17 = t8a, v31 = t15a -dmbutterfly v25, v23, v1.h[1], v1.h[2], v2, v3, v4, v5 // v25 = t9a, v23 = t14a -dmbutterfly v21, v27, v1.h[3], v1.h[4], v2, v3, v4, v5 // v21 = t10a, v27 = t13a -dmbutterfly v29, v19, v1.h[5], v1.h[6], v2, v3, v4, v5 // v29 = t11a, v19 = t12a +dmbutterfly v20, v28, v0.h[2], v0.h[3], v2, v3, v4, v5 // v20 = t2a, v28 = t3a +dmbutterfly v18, v30, v0.h[4], v0.h[5], v2, v3, v4, v5 // v18 = t4a, v30 = t7a +
[libav-devel] [PATCH 3/4] arm: vp9itxfm: Reorder iadst16 coeffs
This matches the order they are in the 16 bpp version. There they are in this order, to make sure we access them in the same order they are declared, easing loading only half of the coefficients at a time. This makes the 8 bpp version match the 16 bpp version better. --- libavcodec/arm/vp9itxfm_neon.S | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S index f74d542..c8eeb76 100644 --- a/libavcodec/arm/vp9itxfm_neon.S +++ b/libavcodec/arm/vp9itxfm_neon.S @@ -37,8 +37,8 @@ idct_coeffs: endconst const iadst16_coeffs, align=4 -.short 16364, 804, 15893, 3981, 14811, 7005, 13160, 9760 -.short 11003, 12140, 8423, 14053, 5520, 15426, 2404, 16207 +.short 16364, 804, 15893, 3981, 11003, 12140, 8423, 14053 +.short 14811, 7005, 13160, 9760, 5520, 15426, 2404, 16207 endconst @ Do four 4x4 transposes, using q registers for the subtransposes that don't @@ -672,19 +672,19 @@ function iadst16 vld1.16 {q0-q1}, [r12,:128] mbutterfly_lq3, q2, d31, d16, d0[1], d0[0] @ q3 = t1, q2 = t0 -mbutterfly_lq5, q4, d23, d24, d2[1], d2[0] @ q5 = t9, q4 = t8 +mbutterfly_lq5, q4, d23, d24, d1[1], d1[0] @ q5 = t9, q4 = t8 butterfly_n d31, d24, q3, q5, q6, q5 @ d31 = t1a, d24 = t9a mbutterfly_lq7, q6, d29, d18, d0[3], d0[2] @ q7 = t3, q6 = t2 butterfly_n d16, d23, q2, q4, q3, q4 @ d16 = t0a, d23 = t8a -mbutterfly_lq3, q2, d21, d26, d2[3], d2[2] @ q3 = t11, q2 = t10 +mbutterfly_lq3, q2, d21, d26, d1[3], d1[2] @ q3 = t11, q2 = t10 butterfly_n d29, d26, q7, q3, q4, q3 @ d29 = t3a, d26 = t11a -mbutterfly_lq5, q4, d27, d20, d1[1], d1[0] @ q5 = t5, q4 = t4 +mbutterfly_lq5, q4, d27, d20, d2[1], d2[0] @ q5 = t5, q4 = t4 butterfly_n d18, d21, q6, q2, q3, q2 @ d18 = t2a, d21 = t10a mbutterfly_lq7, q6, d19, d28, d3[1], d3[0] @ q7 = t13, q6 = t12 butterfly_n d20, d28, q5, q7, q2, q7 @ d20 = t5a, d28 = t13a -mbutterfly_lq3, q2, d25, d22, d1[3], d1[2] @ q3 = t7, q2 = t6 +mbutterfly_lq3, q2, d25, d22, d2[3], d2[2] @ q3 = t7, q2 = t6 butterfly_n d27, d19, q4, q6, q5, q6 @ d27 = t4a, d19 = t12a mbutterfly_lq5, q4, d17, d30, d3[3], d3[2] @ q5 = t15, q4 = t14 -- 2.7.4 ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel
[libav-devel] [PATCH 1/4] arm: vp9itxfm: Reorder the idct coefficients for better pairing
All elements are used pairwise, except for the first one. Previously, the 16th element was unused. Move the unused element to the second slot, to make the later element pairs not split across registers. This simplifies loading only parts of the coefficients, reducing the difference to the 16 bpp version. --- The 16 bpp version is only in ffmpeg for now, since libav's vp9 decoder doesn't support the high bitdepth profiles. This change in itself still makes sense to do though. --- libavcodec/arm/vp9itxfm_neon.S | 124 - 1 file changed, 62 insertions(+), 62 deletions(-) diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S index 167d517..f74d542 100644 --- a/libavcodec/arm/vp9itxfm_neon.S +++ b/libavcodec/arm/vp9itxfm_neon.S @@ -22,7 +22,7 @@ #include "neon.S" const itxfm4_coeffs, align=4 -.short 11585, 6270, 15137, 0 +.short 11585, 0, 6270, 15137 iadst4_coeffs: .short 5283, 15212, 9929, 13377 endconst @@ -30,8 +30,8 @@ endconst const iadst8_coeffs, align=4 .short 16305, 1606, 14449, 7723, 10394, 12665, 4756, 15679 idct_coeffs: -.short 11585, 6270, 15137, 3196, 16069, 13623, 9102, 1606 -.short 16305, 12665, 10394, 7723, 14449, 15679, 4756, 0 +.short 11585, 0, 6270, 15137, 3196, 16069, 13623, 9102 +.short 1606, 16305, 12665, 10394, 7723, 14449, 15679, 4756 .short 804, 16364, 12140, 11003, 7005, 14811, 15426, 5520 .short 3981, 15893, 14053, 8423, 9760, 13160, 16207, 2404 endconst @@ -224,14 +224,14 @@ endconst .endm .macro idct4 c0, c1, c2, c3 -vmull.s16 q13, \c1, d0[2] -vmull.s16 q11, \c1, d0[1] +vmull.s16 q13, \c1, d0[3] +vmull.s16 q11, \c1, d0[2] vadd.i16d16, \c0, \c2 vsub.i16d17, \c0, \c2 -vmlal.s16 q13, \c3, d0[1] +vmlal.s16 q13, \c3, d0[2] vmull.s16 q9, d16, d0[0] vmull.s16 q10, d17, d0[0] -vmlsl.s16 q11, \c3, d0[2] +vmlsl.s16 q11, \c3, d0[3] vrshrn.s32 d26, q13, #14 vrshrn.s32 d18, q9, #14 vrshrn.s32 d20, q10, #14 @@ -350,9 +350,9 @@ itxfm_func4x4 iwht, iwht .macro idct8 dmbutterfly0d16, d17, d24, d25, q8, q12, q2, q4, d4, d5, d8, d9, q3, q2, q5, q4 @ q8 = t0a, q12 = t1a -dmbutterfly d20, d21, d28, d29, d0[1], d0[2], q2, q3, q4, q5 @ q10 = t2a, q14 = t3a -dmbutterfly d18, d19, d30, d31, d0[3], d1[0], q2, q3, q4, q5 @ q9 = t4a, q15 = t7a -dmbutterfly d26, d27, d22, d23, d1[1], d1[2], q2, q3, q4, q5 @ q13 = t5a, q11 = t6a +dmbutterfly d20, d21, d28, d29, d0[2], d0[3], q2, q3, q4, q5 @ q10 = t2a, q14 = t3a +dmbutterfly d18, d19, d30, d31, d1[0], d1[1], q2, q3, q4, q5 @ q9 = t4a, q15 = t7a +dmbutterfly d26, d27, d22, d23, d1[2], d1[3], q2, q3, q4, q5 @ q13 = t5a, q11 = t6a butterfly q2, q14, q8, q14 @ q2 = t0, q14 = t3 butterfly q3, q10, q12, q10 @ q3 = t1, q10 = t2 @@ -386,8 +386,8 @@ itxfm_func4x4 iwht, iwht vneg.s16q15, q15 @ q15 = out[7] butterfly q8, q9, q11, q9 @ q8 = out[0], q9 = t2 -dmbutterfly_l q10, q11, q5, q7, d4, d5, d6, d7, d0[1], d0[2] @ q10,q11 = t5a, q5,q7 = t4a -dmbutterfly_l q2, q3, q13, q14, d12, d13, d8, d9, d0[2], d0[1] @ q2,q3 = t6a, q13,q14 = t7a +dmbutterfly_l q10, q11, q5, q7, d4, d5, d6, d7, d0[2], d0[3] @ q10,q11 = t5a, q5,q7 = t4a +dmbutterfly_l q2, q3, q13, q14, d12, d13, d8, d9, d0[3], d0[2] @ q2,q3 = t6a, q13,q14 = t7a dbutterfly_nd28, d29, d8, d9, q10, q11, q13, q14, q4, q6, q10, q11 @ q14 = out[6], q4 = t7 @@ -588,13 +588,13 @@ endfunc function idct16 mbutterfly0 d16, d24, d16, d24, d4, d6, q2, q3 @ d16 = t0a, d24 = t1a -mbutterfly d20, d28, d0[1], d0[2], q2, q3 @ d20 = t2a, d28 = t3a -mbutterfly d18, d30, d0[3], d1[0], q2, q3 @ d18 = t4a, d30 = t7a -mbutterfly d26, d22, d1[1], d1[2], q2, q3 @ d26 = t5a, d22 = t6a -mbutterfly d17, d31, d1[3], d2[0], q2, q3 @ d17 = t8a, d31 = t15a -mbutterfly d25, d23, d2[1], d2[2], q2, q3 @ d25 = t9a, d23 = t14a -mbutterfly d21, d27, d2[3], d3[0], q2, q3 @ d21 = t10a, d27 = t13a -mbutterfly d29, d19, d3[1], d3[2], q2, q3 @ d29 = t11a, d19 = t12a +mbutterfly d20, d28, d0[2], d0[3], q2, q3 @ d20 = t2a, d28 = t3a +mbutterfly d18, d30, d1[0], d1[1], q2, q3 @ d18 = t4a, d30 = t7a +mbutterfly d26, d22, d1[2], d1[3], q2, q3 @ d26 = t5a, d22 = t6a +mbutterfly d17, d31, d2[0], d2[1], q2, q3 @ d17 = t8a, d31 = t15a +mbutterfly d25, d23, d2[2],
[libav-devel] [PATCH 3/6] aarch64: vp9itxfm: Use a single lane ld1 instead of ld1r where possible
The ld1r is a leftover from the arm version, where this trick is beneficial on some cores. Use a single-lane load where we don't need the semantics of ld1r. --- libavcodec/aarch64/vp9itxfm_neon.S | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index a9c7626..e7b8836 100644 --- a/libavcodec/aarch64/vp9itxfm_neon.S +++ b/libavcodec/aarch64/vp9itxfm_neon.S @@ -255,7 +255,7 @@ function ff_vp9_\txfm1\()_\txfm2\()_4x4_add_neon, export=1 cmp w3, #1 b.ne1f // DC-only for idct/idct -ld1r{v2.4h}, [x2] +ld1 {v2.h}[0], [x2] smull v2.4s, v2.4h, v0.h[0] rshrn v2.4h, v2.4s, #14 smull v2.4s, v2.4h, v0.h[0] @@ -287,8 +287,8 @@ function ff_vp9_\txfm1\()_\txfm2\()_4x4_add_neon, export=1 \txfm2\()4 v4, v5, v6, v7 2: -ld1r{v0.2s}, [x0], x1 -ld1r{v1.2s}, [x0], x1 +ld1 {v0.s}[0], [x0], x1 +ld1 {v1.s}[0], [x0], x1 .ifnc \txfm1,iwht srshr v4.4h, v4.4h, #4 srshr v5.4h, v5.4h, #4 @@ -297,8 +297,8 @@ function ff_vp9_\txfm1\()_\txfm2\()_4x4_add_neon, export=1 .endif uaddw v4.8h, v4.8h, v0.8b uaddw v5.8h, v5.8h, v1.8b -ld1r{v2.2s}, [x0], x1 -ld1r{v3.2s}, [x0], x1 +ld1 {v2.s}[0], [x0], x1 +ld1 {v3.s}[0], [x0], x1 sqxtun v0.8b, v4.8h sqxtun v1.8b, v5.8h sub x0, x0, x1, lsl #2 @@ -394,7 +394,7 @@ function ff_vp9_\txfm1\()_\txfm2\()_8x8_add_neon, export=1 cmp w3, #1 b.ne1f // DC-only for idct/idct -ld1r{v2.4h}, [x2] +ld1 {v2.h}[0], [x2] smull v2.4s, v2.4h, v0.h[0] rshrn v2.4h, v2.4s, #14 smull v2.4s, v2.4h, v0.h[0] @@ -485,7 +485,7 @@ function idct16x16_dc_add_neon moviv1.4h, #0 -ld1r{v2.4h}, [x2] +ld1 {v2.h}[0], [x2] smull v2.4s, v2.4h, v0.h[0] rshrn v2.4h, v2.4s, #14 smull v2.4s, v2.4h, v0.h[0] @@ -1044,7 +1044,7 @@ function idct32x32_dc_add_neon moviv1.4h, #0 -ld1r{v2.4h}, [x2] +ld1 {v2.h}[0], [x2] smull v2.4s, v2.4h, v0.h[0] rshrn v2.4h, v2.4s, #14 smull v2.4s, v2.4h, v0.h[0] -- 2.7.4 ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel
[libav-devel] [PATCH 4/6] aarch64: vp9itxfm: Use the right lane sizes in 8x8 for improved readability
--- libavcodec/aarch64/vp9itxfm_neon.S | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index e7b8836..7582081 100644 --- a/libavcodec/aarch64/vp9itxfm_neon.S +++ b/libavcodec/aarch64/vp9itxfm_neon.S @@ -385,10 +385,10 @@ function ff_vp9_\txfm1\()_\txfm2\()_8x8_add_neon, export=1 .endif ld1 {v0.8h}, [x4] -moviv2.16b, #0 -moviv3.16b, #0 -moviv4.16b, #0 -moviv5.16b, #0 +moviv2.8h, #0 +moviv3.8h, #0 +moviv4.8h, #0 +moviv5.8h, #0 .ifc \txfm1\()_\txfm2,idct_idct cmp w3, #1 @@ -411,11 +411,11 @@ function ff_vp9_\txfm1\()_\txfm2\()_8x8_add_neon, export=1 b 2f .endif 1: -ld1 {v16.16b,v17.16b,v18.16b,v19.16b}, [x2], #64 -ld1 {v20.16b,v21.16b,v22.16b,v23.16b}, [x2], #64 +ld1 {v16.8h,v17.8h,v18.8h,v19.8h}, [x2], #64 +ld1 {v20.8h,v21.8h,v22.8h,v23.8h}, [x2], #64 sub x2, x2, #128 -st1 {v2.16b,v3.16b,v4.16b,v5.16b}, [x2], #64 -st1 {v2.16b,v3.16b,v4.16b,v5.16b}, [x2], #64 +st1 {v2.8h,v3.8h,v4.8h,v5.8h}, [x2], #64 +st1 {v2.8h,v3.8h,v4.8h,v5.8h}, [x2], #64 \txfm1\()8 -- 2.7.4 ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel
[libav-devel] [PATCH 5/6] aarch64: vp9itxfm: Update a comment to refer to a register with a different name
--- libavcodec/aarch64/vp9itxfm_neon.S | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index 7582081..8102720 100644 --- a/libavcodec/aarch64/vp9itxfm_neon.S +++ b/libavcodec/aarch64/vp9itxfm_neon.S @@ -41,8 +41,8 @@ const iadst16_coeffs, align=4 .short 11003, 12140, 8423, 14053, 5520, 15426, 2404, 16207 endconst -// out1 = ((in1 + in2) * d0[0] + (1 << 13)) >> 14 -// out2 = ((in1 - in2) * d0[0] + (1 << 13)) >> 14 +// out1 = ((in1 + in2) * v0[0] + (1 << 13)) >> 14 +// out2 = ((in1 - in2) * v0[0] + (1 << 13)) >> 14 // in/out are .8h registers; this can do with 4 temp registers, but is // more efficient if 6 temp registers are available. .macro dmbutterfly0 out1, out2, in1, in2, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, neg=0 -- 2.7.4 ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel
[libav-devel] [PATCH 6/6] aarch64: vp9itxfm: Fix incorrect vertical alignment
--- libavcodec/aarch64/vp9itxfm_neon.S | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index 8102720..a199e9c 100644 --- a/libavcodec/aarch64/vp9itxfm_neon.S +++ b/libavcodec/aarch64/vp9itxfm_neon.S @@ -225,7 +225,7 @@ endconst add v21.4s,v17.4s,v19.4s rshrn \c0\().4h, v20.4s,#14 add v16.4s,v16.4s,v17.4s -rshrn \c1\().4h, v21.4s, #14 +rshrn \c1\().4h, v21.4s,#14 sub v16.4s,v16.4s,v19.4s rshrn \c2\().4h, v18.4s,#14 rshrn \c3\().4h, v16.4s,#14 @@ -1313,8 +1313,8 @@ function idct32_1d_8x32_pass1\suffix\()_neon bl idct32_odd\suffix -transpose_8x8H v31, v30, v29, v28, v27, v26, v25, v24, v2, v3 -transpose_8x8H v23, v22, v21, v20, v19, v18, v17, v16, v2, v3 +transpose_8x8H v31, v30, v29, v28, v27, v26, v25, v24, v2, v3 +transpose_8x8H v23, v22, v21, v20, v19, v18, v17, v16, v2, v3 // Store the registers a, b horizontally, // adding into the output first, and the mirrored, -- 2.7.4 ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel
[libav-devel] [PATCH 2/6] aarch64: vp9itxfm: Share instructions for loading idct coeffs in the 8x8 function
--- libavcodec/aarch64/vp9itxfm_neon.S | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index c954d1a..a9c7626 100644 --- a/libavcodec/aarch64/vp9itxfm_neon.S +++ b/libavcodec/aarch64/vp9itxfm_neon.S @@ -379,12 +379,11 @@ function ff_vp9_\txfm1\()_\txfm2\()_8x8_add_neon, export=1 // idct, so those always need to be loaded. .ifc \txfm1\()_\txfm2,idct_idct movrel x4, idct_coeffs -ld1 {v0.8h}, [x4] .else movrel x4, iadst8_coeffs ld1 {v1.8h}, [x4], #16 -ld1 {v0.8h}, [x4] .endif +ld1 {v0.8h}, [x4] moviv2.16b, #0 moviv3.16b, #0 -- 2.7.4 ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel
[libav-devel] [PATCH 1/6] arm: vp9itxfm: Share instructions for loading idct coeffs in the 8x8 function
--- libavcodec/arm/vp9itxfm_neon.S | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S index 167d517..3d0b0fa 100644 --- a/libavcodec/arm/vp9itxfm_neon.S +++ b/libavcodec/arm/vp9itxfm_neon.S @@ -412,13 +412,12 @@ function ff_vp9_\txfm1\()_\txfm2\()_8x8_add_neon, export=1 .ifc \txfm1\()_\txfm2,idct_idct movrel r12, idct_coeffs vpush {q4-q5} -vld1.16 {q0}, [r12,:128] .else movrel r12, iadst8_coeffs vld1.16 {q1}, [r12,:128]! vpush {q4-q7} -vld1.16 {q0}, [r12,:128] .endif +vld1.16 {q0}, [r12,:128] vmov.i16q2, #0 vmov.i16q3, #0 -- 2.7.4 ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel
[libav-devel] [PATCH] arm: vp9itxfm: Avoid reloading the idct32 coefficients
The idct32x32 function actually backed up and restored q4-q7 even though it didn't clobber them; there are plenty of registers that can be used to allow keeping all the idct coefficients in registers without having to reload different subsets of them at different stages in the transform. Since the idct16 core transform avoids clobbering q4-q7 (but clobbers q2-q3 instead, to avoid needing to back up and restore q4-q7 at all in the idct16 function), and the lanewise vmul needs a register in the q0-q3 range, we move the stored coefficients from q2-q3 into q4-q5 while doing idct16. While keeping these coefficients in registers, we still can skip backing up and restoring q7. Before: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_32x32_sub32_add_neon: 18553.8 17182.7 14303.3 12089.7 After: vp9_inv_dct_dct_32x32_sub32_add_neon: 18470.3 16717.7 14173.6 11860.8 --- libavcodec/arm/vp9itxfm_neon.S | 246 - 1 file changed, 120 insertions(+), 126 deletions(-) diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S index 167d517..df3f923 100644 --- a/libavcodec/arm/vp9itxfm_neon.S +++ b/libavcodec/arm/vp9itxfm_neon.S @@ -1168,58 +1168,51 @@ function idct32x32_dc_add_neon endfunc .macro idct32_end -butterfly d16, d5, d4, d5 @ d16 = t16a, d5 = t19a +butterfly d16, d9, d8, d9 @ d16 = t16a, d9 = t19a butterfly d17, d20, d23, d20 @ d17 = t17, d20 = t18 -butterfly d18, d6, d7, d6 @ d18 = t23a, d6 = t20a +butterfly d18, d10, d11, d10 @ d18 = t23a, d10 = t20a butterfly d19, d21, d22, d21 @ d19 = t22, d21 = t21 -butterfly d4, d28, d28, d30 @ d4 = t24a, d28 = t27a +butterfly d8, d28, d28, d30 @ d8 = t24a, d28 = t27a butterfly d23, d26, d25, d26 @ d23 = t25, d26 = t26 -butterfly d7, d29, d29, d31 @ d7 = t31a, d29 = t28a +butterfly d11, d29, d29, d31 @ d11 = t31a, d29 = t28a butterfly d22, d27, d24, d27 @ d22 = t30, d27 = t29 mbutterfly d27, d20, d0[1], d0[2], q12, q15@ d27 = t18a, d20 = t29a -mbutterfly d29, d5, d0[1], d0[2], q12, q15@ d29 = t19, d5 = t28 -mbutterfly d28, d6, d0[1], d0[2], q12, q15, neg=1 @ d28 = t27, d6 = t20 +mbutterfly d29, d9, d0[1], d0[2], q12, q15@ d29 = t19, d9 = t28 +mbutterfly d28, d10, d0[1], d0[2], q12, q15, neg=1 @ d28 = t27, d10 = t20 mbutterfly d26, d21, d0[1], d0[2], q12, q15, neg=1 @ d26 = t26a, d21 = t21a -butterfly d31, d24, d7, d4 @ d31 = t31, d24 = t24 +butterfly d31, d24, d11, d8 @ d31 = t31, d24 = t24 butterfly d30, d25, d22, d23 @ d30 = t30a, d25 = t25a butterfly_r d23, d16, d16, d18 @ d23 = t23, d16 = t16 butterfly_r d22, d17, d17, d19 @ d22 = t22a, d17 = t17a butterfly d18, d21, d27, d21 @ d18 = t18, d21 = t21 -butterfly_r d27, d28, d5, d28 @ d27 = t27a, d28 = t28a -butterfly d4, d26, d20, d26 @ d4 = t29, d26 = t26 -butterfly d19, d20, d29, d6 @ d19 = t19a, d20 = t20 -vmovd29, d4@ d29 = t29 - -mbutterfly0 d27, d20, d27, d20, d4, d6, q2, q3 @ d27 = t27, d20 = t20 -mbutterfly0 d26, d21, d26, d21, d4, d6, q2, q3 @ d26 = t26a, d21 = t21a -mbutterfly0 d25, d22, d25, d22, d4, d6, q2, q3 @ d25 = t25, d22 = t22 -mbutterfly0 d24, d23, d24, d23, d4, d6, q2, q3 @ d24 = t24a, d23 = t23a +butterfly_r d27, d28, d9, d28 @ d27 = t27a, d28 = t28a +butterfly d8, d26, d20, d26 @ d8 = t29, d26 = t26 +butterfly d19, d20, d29, d10 @ d19 = t19a, d20 = t20 +vmovd29, d8@ d29 = t29 + +mbutterfly0 d27, d20, d27, d20, d8, d10, q4, q5 @ d27 = t27, d20 = t20 +mbutterfly0 d26, d21, d26, d21, d8, d10, q4, q5 @ d26 = t26a, d21 = t21a +mbutterfly0 d25, d22, d25, d22, d8, d10, q4, q5 @ d25 = t25, d22 = t22 +mbutterfly0 d24, d23, d24, d23, d8, d10, q4, q5 @ d24 = t24a, d23 = t23a bx lr .endm function idct32_odd -movrel r12, idct_coeffs -add r12, r12, #32 -vld1.16 {q0-q1}, [r12,:128] - -mbutterfly d16, d31, d0[0], d0[1], q2, q3 @ d16 = t16a, d31 = t31a -mbutterfly d24, d23, d0[2], d0[3], q2, q3 @ d24 = t17a, d23 = t30a -mbutterfly d20, d27, d1[0], d1[1], q2, q3 @ d20 = t18a, d27 = t29a -mbutterfly d28, d19, d1[2], d1[3], q2, q3 @ d28 = t19a, d19 = t28a -mbutterfly d18, d29, d2[0], d2[1], q2, q3 @ d18 = t20a, d29 = t27a -mbutterfly d26, d21, d2[2], d2[3], q2, q3 @ d26 = t21a, d21 = t26a -
[libav-devel] [PATCH] aarch64: vp9itxfm: Avoid reloading the idct32 coefficients
The idct32x32 function actually backed up and restored d8-d15 even though it didn't clobber them; there are plenty of registers that can be used to allow keeping all the idct coefficients in registers without having to reload different subsets of them at different stages in the transform. After this, we still can skip backing up and restoring d12-d15. Before: vp9_inv_dct_dct_32x32_sub32_add_neon: 8128.3 After: vp9_inv_dct_dct_32x32_sub32_add_neon: 8053.3 --- libavcodec/aarch64/vp9itxfm_neon.S | 110 +++-- 1 file changed, 43 insertions(+), 67 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index c954d1a..64286df 100644 --- a/libavcodec/aarch64/vp9itxfm_neon.S +++ b/libavcodec/aarch64/vp9itxfm_neon.S @@ -1106,18 +1106,14 @@ endfunc .endm function idct32_odd -ld1 {v0.8h,v1.8h}, [x11] - -dmbutterfly v16, v31, v0.h[0], v0.h[1], v4, v5, v6, v7 // v16 = t16a, v31 = t31a -dmbutterfly v24, v23, v0.h[2], v0.h[3], v4, v5, v6, v7 // v24 = t17a, v23 = t30a -dmbutterfly v20, v27, v0.h[4], v0.h[5], v4, v5, v6, v7 // v20 = t18a, v27 = t29a -dmbutterfly v28, v19, v0.h[6], v0.h[7], v4, v5, v6, v7 // v28 = t19a, v19 = t28a -dmbutterfly v18, v29, v1.h[0], v1.h[1], v4, v5, v6, v7 // v18 = t20a, v29 = t27a -dmbutterfly v26, v21, v1.h[2], v1.h[3], v4, v5, v6, v7 // v26 = t21a, v21 = t26a -dmbutterfly v22, v25, v1.h[4], v1.h[5], v4, v5, v6, v7 // v22 = t22a, v25 = t25a -dmbutterfly v30, v17, v1.h[6], v1.h[7], v4, v5, v6, v7 // v30 = t23a, v17 = t24a - -ld1 {v0.8h}, [x10] +dmbutterfly v16, v31, v8.h[0], v8.h[1], v4, v5, v6, v7 // v16 = t16a, v31 = t31a +dmbutterfly v24, v23, v8.h[2], v8.h[3], v4, v5, v6, v7 // v24 = t17a, v23 = t30a +dmbutterfly v20, v27, v8.h[4], v8.h[5], v4, v5, v6, v7 // v20 = t18a, v27 = t29a +dmbutterfly v28, v19, v8.h[6], v8.h[7], v4, v5, v6, v7 // v28 = t19a, v19 = t28a +dmbutterfly v18, v29, v9.h[0], v9.h[1], v4, v5, v6, v7 // v18 = t20a, v29 = t27a +dmbutterfly v26, v21, v9.h[2], v9.h[3], v4, v5, v6, v7 // v26 = t21a, v21 = t26a +dmbutterfly v22, v25, v9.h[4], v9.h[5], v4, v5, v6, v7 // v22 = t22a, v25 = t25a +dmbutterfly v30, v17, v9.h[6], v9.h[7], v4, v5, v6, v7 // v30 = t23a, v17 = t24a butterfly_8hv4, v24, v16, v24 // v4 = t16, v24 = t17 butterfly_8hv5, v20, v28, v20 // v5 = t19, v20 = t18 @@ -1136,18 +1132,14 @@ function idct32_odd endfunc function idct32_odd_half -ld1 {v0.8h,v1.8h}, [x11] - -dmbutterfly_h1 v16, v31, v0.h[0], v0.h[1], v4, v5, v6, v7 // v16 = t16a, v31 = t31a -dmbutterfly_h2 v24, v23, v0.h[2], v0.h[3], v4, v5, v6, v7 // v24 = t17a, v23 = t30a -dmbutterfly_h1 v20, v27, v0.h[4], v0.h[5], v4, v5, v6, v7 // v20 = t18a, v27 = t29a -dmbutterfly_h2 v28, v19, v0.h[6], v0.h[7], v4, v5, v6, v7 // v28 = t19a, v19 = t28a -dmbutterfly_h1 v18, v29, v1.h[0], v1.h[1], v4, v5, v6, v7 // v18 = t20a, v29 = t27a -dmbutterfly_h2 v26, v21, v1.h[2], v1.h[3], v4, v5, v6, v7 // v26 = t21a, v21 = t26a -dmbutterfly_h1 v22, v25, v1.h[4], v1.h[5], v4, v5, v6, v7 // v22 = t22a, v25 = t25a -dmbutterfly_h2 v30, v17, v1.h[6], v1.h[7], v4, v5, v6, v7 // v30 = t23a, v17 = t24a - -ld1 {v0.8h}, [x10] +dmbutterfly_h1 v16, v31, v8.h[0], v8.h[1], v4, v5, v6, v7 // v16 = t16a, v31 = t31a +dmbutterfly_h2 v24, v23, v8.h[2], v8.h[3], v4, v5, v6, v7 // v24 = t17a, v23 = t30a +dmbutterfly_h1 v20, v27, v8.h[4], v8.h[5], v4, v5, v6, v7 // v20 = t18a, v27 = t29a +dmbutterfly_h2 v28, v19, v8.h[6], v8.h[7], v4, v5, v6, v7 // v28 = t19a, v19 = t28a +dmbutterfly_h1 v18, v29, v9.h[0], v9.h[1], v4, v5, v6, v7 // v18 = t20a, v29 = t27a +dmbutterfly_h2 v26, v21, v9.h[2], v9.h[3], v4, v5, v6, v7 // v26 = t21a, v21 = t26a +dmbutterfly_h1 v22, v25, v9.h[4], v9.h[5], v4, v5, v6, v7 // v22 = t22a, v25 = t25a +dmbutterfly_h2 v30, v17, v9.h[6], v9.h[7], v4, v5, v6, v7 // v30 = t23a, v17 = t24a butterfly_8hv4, v24, v16, v24 // v4 = t16, v24 = t17 butterfly_8hv5, v20, v28, v20 // v5 = t19, v20 = t18 @@ -1166,18 +1158,14 @@ function idct32_odd_half endfunc function idct32_odd_quarter -ld1 {v0.8h,v1.8h}, [x11] - -dsmull_hv4, v5, v16, v0.h[0] -dsmull_hv28, v29, v19, v0.h[7] -dsmull_hv30, v31, v16, v0.h[1] -dsmull_hv22, v23, v17, v1.h[6] -dsmull_hv7, v6, v17, v1.h[7] -dsmull_hv26, v27, v19, v0.h[6] -dsmull_hv20, v21, v18, v1.h[0] -dsmull_hv24, v25, v18, v1.h[1] - -ld1 {v0.8h}, [x10] +
Re: [libav-devel] [PATCH] hlsenc: Correctly write down all 16 bytes in hex
On 08/02/2017 13:42, Luca Barbato wrote: > --- > > libavformat/hlsenc.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/libavformat/hlsenc.c b/libavformat/hlsenc.c > index 05c9adb..3496bdd 100644 > --- a/libavformat/hlsenc.c > +++ b/libavformat/hlsenc.c > @@ -106,7 +106,7 @@ static int dict_set_bin(AVDictionary **dict, const char > *key, uint8_t *buf) > { > char hex[33]; > > -ff_data_to_hex(hex, buf, sizeof(buf), 0); > +ff_data_to_hex(hex, buf, 16, 0); > hex[32] = '\0'; > > return av_dict_set(dict, key, hex, 0); > -- > 2.9.2 > Ping. ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel
[libav-devel] [PATCH] hwcontext_dxva2: support D3D9Ex
D3D9Ex uses different driver paths. This helps with "headless" configurations when no user logs in. Plain D3D9 device creation will fail if no user is logged in, while it works with D3D9Ex. --- libavutil/hwcontext_dxva2.c | 117 1 file changed, 87 insertions(+), 30 deletions(-) diff --git a/libavutil/hwcontext_dxva2.c b/libavutil/hwcontext_dxva2.c index ccf03c8e9f..3790bed4b7 100644 --- a/libavutil/hwcontext_dxva2.c +++ b/libavutil/hwcontext_dxva2.c @@ -38,8 +38,22 @@ #include "pixfmt.h" typedef IDirect3D9* WINAPI pDirect3DCreate9(UINT); +typedef HRESULT WINAPI pDirect3DCreate9Ex(UINT, IDirect3D9Ex **); typedef HRESULT WINAPI pCreateDeviceManager9(UINT *, IDirect3DDeviceManager9 **); +#define FF_D3DCREATE_FLAGS (D3DCREATE_SOFTWARE_VERTEXPROCESSING | \ +D3DCREATE_MULTITHREADED | \ +D3DCREATE_FPU_PRESERVE) + +static const D3DPRESENT_PARAMETERS dxva2_present_params = { +.Windowed = TRUE, +.BackBufferWidth = 640, +.BackBufferHeight = 480, +.BackBufferCount = 0, +.SwapEffect = D3DSWAPEFFECT_DISCARD, +.Flags= D3DPRESENTFLAG_VIDEO, +}; + typedef struct DXVA2Mapping { uint32_t palette_dummy[256]; } DXVA2Mapping; @@ -411,19 +425,83 @@ static void dxva2_device_free(AVHWDeviceContext *ctx) av_freep(>user_opaque); } +static int dxva2_device_create9(AVHWDeviceContext *ctx, UINT adapter) +{ +DXVA2DevicePriv *priv = ctx->user_opaque; +D3DPRESENT_PARAMETERS d3dpp = dxva2_present_params; +D3DDISPLAYMODE d3ddm; +HRESULT hr; +pDirect3DCreate9 *createD3D = (pDirect3DCreate9 *)GetProcAddress(priv->d3dlib, "Direct3DCreate9"); +if (!createD3D) { +av_log(ctx, AV_LOG_ERROR, "Failed to locate Direct3DCreate9\n"); +return AVERROR_UNKNOWN; +} + +priv->d3d9 = createD3D(D3D_SDK_VERSION); +if (!priv->d3d9) { +av_log(ctx, AV_LOG_ERROR, "Failed to create IDirect3D object\n"); +return AVERROR_UNKNOWN; +} + +IDirect3D9_GetAdapterDisplayMode(priv->d3d9, adapter, ); + +d3dpp.BackBufferFormat = d3ddm.Format; + +hr = IDirect3D9_CreateDevice(priv->d3d9, adapter, D3DDEVTYPE_HAL, GetShellWindow(), +FF_D3DCREATE_FLAGS, +, >d3d9device); +if (FAILED(hr)) { +av_log(ctx, AV_LOG_ERROR, "Failed to create Direct3D device\n"); +return AVERROR_UNKNOWN; +} + +return 0; +} + +static int dxva2_device_create9ex(AVHWDeviceContext *ctx, UINT adapter) +{ +DXVA2DevicePriv *priv = ctx->user_opaque; +D3DPRESENT_PARAMETERS d3dpp = dxva2_present_params; +D3DDISPLAYMODEEX modeex = {0}; +IDirect3D9Ex *d3d9ex = NULL; +IDirect3DDevice9Ex *exdev = NULL; +HRESULT hr; +pDirect3DCreate9Ex *createD3DEx = (pDirect3DCreate9Ex *)GetProcAddress(priv->d3dlib, "Direct3DCreate9Ex"); +if (!createD3DEx) +return AVERROR(ENOSYS); + +hr = createD3DEx(D3D_SDK_VERSION, ); +if (FAILED(hr)) +return AVERROR_UNKNOWN; + +IDirect3D9Ex_GetAdapterDisplayModeEx(d3d9ex, adapter, , NULL); + +d3dpp.BackBufferFormat = modeex.Format; + +hr = IDirect3D9Ex_CreateDeviceEx(d3d9ex, adapter, D3DDEVTYPE_HAL, GetShellWindow(), + FF_D3DCREATE_FLAGS, + , NULL, ); +if (FAILED(hr)) { +IDirect3D9Ex_Release(d3d9ex); +return AVERROR_UNKNOWN; +} + +av_log(ctx, AV_LOG_VERBOSE, "Using D3D9Ex device.\n"); +priv->d3d9 = (IDirect3D9 *)d3d9ex; +priv->d3d9device = (IDirect3DDevice9 *)exdev; +return 0; +} + static int dxva2_device_create(AVHWDeviceContext *ctx, const char *device, AVDictionary *opts, int flags) { AVDXVA2DeviceContext *hwctx = ctx->hwctx; DXVA2DevicePriv *priv; - -pDirect3DCreate9 *createD3D = NULL; pCreateDeviceManager9 *createDeviceManager = NULL; -D3DPRESENT_PARAMETERS d3dpp = {0}; -D3DDISPLAYMODEd3ddm; unsigned resetToken = 0; UINT adapter = D3DADAPTER_DEFAULT; HRESULT hr; +int err; if (device) adapter = atoi(device); @@ -448,11 +526,6 @@ static int dxva2_device_create(AVHWDeviceContext *ctx, const char *device, return AVERROR_UNKNOWN; } -createD3D = (pDirect3DCreate9 *)GetProcAddress(priv->d3dlib, "Direct3DCreate9"); -if (!createD3D) { -av_log(ctx, AV_LOG_ERROR, "Failed to locate Direct3DCreate9\n"); -return AVERROR_UNKNOWN; -} createDeviceManager = (pCreateDeviceManager9 *)GetProcAddress(priv->dxva2lib, "DXVA2CreateDirect3DDeviceManager9"); if (!createDeviceManager) { @@ -460,27 +533,11 @@ static int dxva2_device_create(AVHWDeviceContext *ctx, const char *device, return AVERROR_UNKNOWN; } -priv->d3d9 = createD3D(D3D_SDK_VERSION); -
Re: [libav-devel] [PATCH 5/5] aarch64: vp9itxfm: Do separate functions for half/quarter idct16 and idct32 (alternative 2)
On Thu, 9 Feb 2017, Janne Grunau wrote: On 2017-02-06 00:16:41 +0200, Martin Storsjö wrote: Ok, so after running a slightly shorter clip (which seems to have about as large percentage of runtime doing IDCT as the previous one) with a bit more iterations, I've got the following results (the 'user' part from 'time avconv -threads 1 -i foo -f null -'): 32 orig 32 alt1 32 alt2 64 orig 64 alt1 64 alt2 40.436s 40.148s 40.008s 37.428s 37.356s 37.192s 40.596s 40.140s 40.216s 37.572s 37.524s 37.384s 40.512s 40.228s 40.188s 37.740s 37.588s 37.368s 40.584s 40.136s 40.216s 37.880s 37.492s 37.348s 40.572s 40.292s 40.232s 37.756s 37.556s 37.676s 40.764s 40.312s 40.232s 37.876s 37.640s 37.468s 40.688s 40.284s 40.368s 37.972s 37.608s 37.460s So while alt2 is faster in most runs, the margin is not quite as big as in the previous benchmark. (The benchmarks were done on a practically unloaded system so it shouldn't vary too much from run to run, but in practice, the first few runs seem to be slightly faster than the later ones.) I.e. around 400 ms gain out of 40 s for alt1, and then another -50 - +150 ms speedup on top of that for alt2. What do you think? At least it looks like the difference between alt1 and alt2 are quite similar on 32- and 64-bit. So we should use the same variant on both archs. I favor alternate 2. Ok then - I'll try to polish up and push alternative 2 based on the feedback I got. // Martin ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel
Re: [libav-devel] [PATCH 5/5] aarch64: vp9itxfm: Do a simpler half/quarter idct16/idct32 when possible (alternative 1)
On Thu, 9 Feb 2017, Janne Grunau wrote: On 2017-02-09 09:50:48 +0200, Martin Storsjö wrote: On Thu, 9 Feb 2017, Janne Grunau wrote: >On 2017-02-05 14:05:49 +0200, Martin Storsjö wrote: >>On Sun, 5 Feb 2017, Janne Grunau wrote: >> // out1 = in1 + in2 // out2 = in1 - in2 .macro butterfly_8h out1, out2, in1, in2 @@ -463,7 +510,7 @@ function idct16x16_dc_add_neon ret endfunc -function idct16 +.macro idct16_full dmbutterfly0v16, v24, v16, v24, v2, v3, v4, v5, v6, v7 // v16 = t0a, v24 = t1a dmbutterfly v20, v28, v0.h[1], v0.h[2], v2, v3, v4, v5 // v20 = t2a, v28 = t3a dmbutterfly v18, v30, v0.h[3], v0.h[4], v2, v3, v4, v5 // v18 = t4a, v30 = t7a @@ -485,7 +532,10 @@ function idct16 dmbutterfly0v22, v26, v22, v26, v2, v3, v18, v19, v30, v31 // v22 = t6a, v26 = t5a dmbutterfly v23, v25, v0.h[1], v0.h[2], v18, v19, v30, v31 // v23 = t9a, v25 = t14a dmbutterfly v27, v21, v0.h[1], v0.h[2], v18, v19, v30, v31, neg=1 // v27 = t13a, v21 = t10a +idct16_end >>> >>>I think it would be clearer if idct16_end is used directly from the macro. >>>it would probably also make sense to move idct16_end and avoid the >>>idct16_full macro. The patch might be smaller and it is immediately >>>obvious that there is no code change but the resulting code is more >>>comlicated than it needs to be. same applies to arm if we go with >>>alternative 1. >> >>Ok, so you mean like this? >> >>function idct16 >>dmbutterfly... >> >>idct16_end >>endfunc > >that would be one option, the other would be to move the idct_end >instructions as a macro out of the the existing idct16 function and use it >as macro. That would make the full idct structural identical to the half >and quarter version and avoid a macro only used once. I'm not really following what you're suggesting here - can you outline it with a code sample like mine above? sorry, it seems I wasn't fully awake. I misread your code snipped. To avoid any confusing here is what I ment outlined as pseudo patch: @@ +.macro idct16_end +[code from the existing idct16 function] +.endm + function idct16 @@ ... +idct16_end -[code moved to the idct16_end macro] endfunc Right - yes, that's exactly what I meant, and what I did locally based on your earlier comment. // Martin ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel
Re: [libav-devel] [PATCH 5/5] aarch64: vp9itxfm: Do a simpler half/quarter idct16/idct32 when possible (alternative 1)
On 2017-02-09 09:50:48 +0200, Martin Storsjö wrote: > On Thu, 9 Feb 2017, Janne Grunau wrote: > > >On 2017-02-05 14:05:49 +0200, Martin Storsjö wrote: > >>On Sun, 5 Feb 2017, Janne Grunau wrote: > >> > // out1 = in1 + in2 > // out2 = in1 - in2 > .macro butterfly_8h out1, out2, in1, in2 > @@ -463,7 +510,7 @@ function idct16x16_dc_add_neon > ret > endfunc > > -function idct16 > +.macro idct16_full > dmbutterfly0v16, v24, v16, v24, v2, v3, v4, v5, v6, v7 // > v16 = t0a, v24 = t1a > dmbutterfly v20, v28, v0.h[1], v0.h[2], v2, v3, v4, v5 // > v20 = t2a, v28 = t3a > dmbutterfly v18, v30, v0.h[3], v0.h[4], v2, v3, v4, v5 // > v18 = t4a, v30 = t7a > @@ -485,7 +532,10 @@ function idct16 > dmbutterfly0v22, v26, v22, v26, v2, v3, v18, v19, v30, v31 > // v22 = t6a, v26 = t5a > dmbutterfly v23, v25, v0.h[1], v0.h[2], v18, v19, v30, v31 > // v23 = t9a, v25 = t14a > dmbutterfly v27, v21, v0.h[1], v0.h[2], v18, v19, v30, v31, > neg=1 // v27 = t13a, v21 = t10a > +idct16_end > >>> > >>>I think it would be clearer if idct16_end is used directly from the macro. > >>>it would probably also make sense to move idct16_end and avoid the > >>>idct16_full macro. The patch might be smaller and it is immediately > >>>obvious that there is no code change but the resulting code is more > >>>comlicated than it needs to be. same applies to arm if we go with > >>>alternative 1. > >> > >>Ok, so you mean like this? > >> > >>function idct16 > >>dmbutterfly... > >> > >>idct16_end > >>endfunc > > > >that would be one option, the other would be to move the idct_end > >instructions as a macro out of the the existing idct16 function and use it > >as macro. That would make the full idct structural identical to the half > >and quarter version and avoid a macro only used once. > > I'm not really following what you're suggesting here - can you outline it > with a code sample like mine above? sorry, it seems I wasn't fully awake. I misread your code snipped. To avoid any confusing here is what I ment outlined as pseudo patch: @@ +.macro idct16_end +[code from the existing idct16 function] +.endm + function idct16 @@ ... +idct16_end -[code moved to the idct16_end macro] endfunc Janne ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel