Re: [libav-devel] [PATCH] utvideodec: Add a missing include

2017-02-09 Thread Martin Storsjö

On Fri, 10 Feb 2017, Martin Storsjö wrote:


This was missing from 77c23704c76, fixing building.
---
libavcodec/utvideodec.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/libavcodec/utvideodec.c b/libavcodec/utvideodec.c
index 381b4f7..808e3be 100644
--- a/libavcodec/utvideodec.c
+++ b/libavcodec/utvideodec.c
@@ -33,6 +33,7 @@
#include "bitstream.h"
#include "bswapdsp.h"
#include "bytestream.h"
+#include "internal.h"
#include "thread.h"
#include "utvideo.h"

--
2.10.1 (Apple Git-78)


Approved by wm4 on irc, pushed.

// Martin
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

[libav-devel] [PATCH] utvideodec: Add a missing include

2017-02-09 Thread Martin Storsjö
This was missing from 77c23704c76, fixing building.
---
 libavcodec/utvideodec.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/libavcodec/utvideodec.c b/libavcodec/utvideodec.c
index 381b4f7..808e3be 100644
--- a/libavcodec/utvideodec.c
+++ b/libavcodec/utvideodec.c
@@ -33,6 +33,7 @@
 #include "bitstream.h"
 #include "bswapdsp.h"
 #include "bytestream.h"
+#include "internal.h"
 #include "thread.h"
 #include "utvideo.h"
 
-- 
2.10.1 (Apple Git-78)

___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH] travis: Ignore the filter-fade test

2017-02-09 Thread Vittorio Giovara
On Thu, Feb 9, 2017 at 8:30 PM, Luca Barbato  wrote:
> On 26/01/2017 12:42, Luca Barbato wrote:
>> It glitches with the stale travis linux target.
>> ---
>>
>>  .travis.yml | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/.travis.yml b/.travis.yml
>> index 8e9629a..f7dab48 100644
>> --- a/.travis.yml
>> +++ b/.travis.yml
>> @@ -20,7 +20,7 @@ install:
>>- if [ "$TRAVIS_OS_NAME" == "osx" ]; then brew install yasm; fi
>>  script:
>>- mkdir -p libav-samples
>> -  - ./configure --samples=libav-samples --cc=$CC
>> +  - ./configure --samples=libav-samples --cc=$CC --ignore-tests=filter-fade
>>- make -j 8
>>- make fate-rsync
>>- make check -j 8
>> --
>> 2.9.2
>
> Ping, I'd merge it tomorrow, not into figuring out what makes that
> combination of compiler and vm upset with that specific filter.

I'm ok with it but please change the commit log to something more descriptive.
-- 
Vittorio
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH] travis: Ignore the filter-fade test

2017-02-09 Thread Luca Barbato
On 26/01/2017 12:42, Luca Barbato wrote:
> It glitches with the stale travis linux target.
> ---
> 
>  .travis.yml | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/.travis.yml b/.travis.yml
> index 8e9629a..f7dab48 100644
> --- a/.travis.yml
> +++ b/.travis.yml
> @@ -20,7 +20,7 @@ install:
>- if [ "$TRAVIS_OS_NAME" == "osx" ]; then brew install yasm; fi
>  script:
>- mkdir -p libav-samples
> -  - ./configure --samples=libav-samples --cc=$CC
> +  - ./configure --samples=libav-samples --cc=$CC --ignore-tests=filter-fade
>- make -j 8
>- make fate-rsync
>- make check -j 8
> --
> 2.9.2

Ping, I'd merge it tomorrow, not into figuring out what makes that
combination of compiler and vm upset with that specific filter.

lu

___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

[libav-devel] [PATCH] hlsenc: Correctly write down all 16 bytes in hex

2017-02-09 Thread Luca Barbato
---
 libavformat/hlsenc.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/libavformat/hlsenc.c b/libavformat/hlsenc.c
index 05c9adb..7aef02b 100644
--- a/libavformat/hlsenc.c
+++ b/libavformat/hlsenc.c
@@ -102,11 +102,12 @@ static void free_encryption(AVFormatContext *s)
 av_freep(>key_basename);
 }
 
-static int dict_set_bin(AVDictionary **dict, const char *key, uint8_t *buf)
+static int dict_set_bin(AVDictionary **dict, const char *key,
+uint8_t *buf, size_t len)
 {
 char hex[33];
 
-ff_data_to_hex(hex, buf, sizeof(buf), 0);
+ff_data_to_hex(hex, buf, len, 0);
 hex[32] = '\0';
 
 return av_dict_set(dict, key, hex, 0);
@@ -136,7 +137,7 @@ static int setup_encryption(AVFormatContext *s)
 return AVERROR(EINVAL);
 }
 
-if ((ret = dict_set_bin(>enc_opts, "key", hls->key)) < 0)
+if ((ret = dict_set_bin(>enc_opts, "key", hls->key, 
hls->key_len)) < 0)
 return ret;
 k = hls->key;
 } else {
@@ -145,7 +146,7 @@ static int setup_encryption(AVFormatContext *s)
 return ret;
 }
 
-if ((ret = dict_set_bin(>enc_opts, "key", buf)) < 0)
+if ((ret = dict_set_bin(>enc_opts, "key", buf, sizeof(buf))) < 0)
 return ret;
 k = buf;
 }
@@ -158,7 +159,7 @@ static int setup_encryption(AVFormatContext *s)
 return AVERROR(EINVAL);
 }
 
-if ((ret = dict_set_bin(>enc_opts, "iv", hls->iv)) < 0)
+if ((ret = dict_set_bin(>enc_opts, "iv", hls->iv, hls->iv_len)) < 
0)
 return ret;
 }
 
-- 
2.9.2

___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH] hlsenc: Correctly write down all 16 bytes in hex

2017-02-09 Thread Luca Barbato
On 09/02/2017 20:21, Anton Khirnov wrote:
> Looks very unsafe. Just pass the buffer size as a function parameter.

Ok, the size must be 16 though.
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH] avcodec/nvenc: make gpu indices independent of supported capabilities

2017-02-09 Thread Ganapathy Raman Kasi
Thanks Luca. Please help submit the patch.

Ganapathy

-Original Message-
From: libav-devel [mailto:libav-devel-boun...@libav.org] On Behalf Of Luca 
Barbato
Sent: Wednesday, February 8, 2017 4:18 PM
To: libav-devel@libav.org
Subject: Re: [libav-devel] [PATCH] avcodec/nvenc: make gpu indices independent 
of supported capabilities

On 08/02/2017 23:52, Ganapathy Raman Kasi wrote:
> Hi,
> 
> This patch fixes multiple unnecessary cuda contexts which are created 
> incase the gpu device to use is greater than 0. Each cuda context 
> creation takes about 100ms and this patch helps in reducing the 
> initialization time incase we are using one of the secondary gpus in 
> the system. The patch is being ported to libav. Please let me know if 
> there is a better way to port patches. Thanks.

Looks good, thank you! and sending like this is ok, if you could use git 
send-email it might be faster, but I guess depends on your mailing system.

lu



___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH] hlsenc: Correctly write down all 16 bytes in hex

2017-02-09 Thread Anton Khirnov
Quoting Luca Barbato (2017-02-08 13:42:30)
> ---
> 
>  libavformat/hlsenc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/libavformat/hlsenc.c b/libavformat/hlsenc.c
> index 05c9adb..3496bdd 100644
> --- a/libavformat/hlsenc.c
> +++ b/libavformat/hlsenc.c
> @@ -106,7 +106,7 @@ static int dict_set_bin(AVDictionary **dict, const char 
> *key, uint8_t *buf)
>  {
>  char hex[33];
> 
> -ff_data_to_hex(hex, buf, sizeof(buf), 0);
> +ff_data_to_hex(hex, buf, 16, 0);
>  hex[32] = '\0';
> 
>  return av_dict_set(dict, key, hex, 0);
> --
> 2.9.2

Looks very unsafe. Just pass the buffer size as a function parameter.

-- 
Anton Khirnov
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH 1/6] arm: vp9itxfm: Share instructions for loading idct coeffs in the 8x8 function

2017-02-09 Thread Janne Grunau
On 2017-02-09 14:29:56 +0200, Martin Storsjö wrote:
> ---
>  libavcodec/arm/vp9itxfm_neon.S | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S
> index 167d517..3d0b0fa 100644
> --- a/libavcodec/arm/vp9itxfm_neon.S
> +++ b/libavcodec/arm/vp9itxfm_neon.S
> @@ -412,13 +412,12 @@ function ff_vp9_\txfm1\()_\txfm2\()_8x8_add_neon, 
> export=1
>  .ifc \txfm1\()_\txfm2,idct_idct
>  movrel  r12, idct_coeffs
>  vpush   {q4-q5}
> -vld1.16 {q0}, [r12,:128]
>  .else
>  movrel  r12, iadst8_coeffs
>  vld1.16 {q1}, [r12,:128]!
>  vpush   {q4-q7}
> -vld1.16 {q0}, [r12,:128]
>  .endif
> +vld1.16 {q0}, [r12,:128]
>  
>  vmov.i16q2, #0
>  vmov.i16q3, #0

the whole set is ok

Janne
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH] dv: Convert to the new bitreader

2017-02-09 Thread Luca Barbato
On 09/02/2017 17:42, Diego Biurrun wrote:
> I guess the size variables should have size_t type ;-p

As you like.
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH] dv: Convert to the new bitreader

2017-02-09 Thread Diego Biurrun
On Thu, Feb 09, 2017 at 05:41:21PM +0100, Diego Biurrun wrote:
> --- a/libavcodec/bitstream.h
> +++ b/libavcodec/bitstream.h
> @@ -384,4 +384,32 @@ static inline int bitstream_apply_sign(BitstreamContext 
> *bc, int val)
>  
> +static inline void bitstream_unget(BitstreamContext *bc, uint64_t value, int 
> size)
> +{
> +int cache_size = sizeof(bc->bits) * 8;

I guess the size variables should have size_t type ;-p

Diego
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

[libav-devel] [PATCH] dv: Convert to the new bitreader

2017-02-09 Thread Diego Biurrun
From: Luca Barbato 

---

Moved the bitstream_unwind() and bitstream_unget() functions to bitstream.h
as requested by Anton.

 libavcodec/bitstream.h | 28 +++
 libavcodec/dvdec.c | 94 ++
 2 files changed, 69 insertions(+), 53 deletions(-)

diff --git a/libavcodec/bitstream.h b/libavcodec/bitstream.h
index 996e32e..f75a35c 100644
--- a/libavcodec/bitstream.h
+++ b/libavcodec/bitstream.h
@@ -384,4 +384,32 @@ static inline int bitstream_apply_sign(BitstreamContext 
*bc, int val)
 return (val ^ sign) - sign;
 }
 
+/* Unwind the cache so a refill_32 can fill it again. */
+static inline void bitstream_unwind(BitstreamContext *bc)
+{
+int unwind = 4;
+int unwind_bits = unwind * 8;
+
+if (bc->bits_left < unwind_bits)
+return;
+
+bc->bits  >>= unwind_bits;
+bc->bits  <<= unwind_bits;
+bc->bits_left  -= unwind_bits;
+bc->ptr-= unwind;
+}
+
+/* Unget up to 32 bits. */
+static inline void bitstream_unget(BitstreamContext *bc, uint64_t value, int 
size)
+{
+int cache_size = sizeof(bc->bits) * 8;
+
+if (bc->bits_left + size > cache_size)
+bitstream_unwind(bc);
+
+bc->bits = (bc->bits >> size) | (value << (cache_size - size));
+
+bc->bits_left += size;
+}
+
 #endif /* AVCODEC_BITSTREAM_H */
diff --git a/libavcodec/dvdec.c b/libavcodec/dvdec.c
index dc37a5e..a2f0171 100644
--- a/libavcodec/dvdec.c
+++ b/libavcodec/dvdec.c
@@ -40,9 +40,9 @@
 #include "libavutil/pixdesc.h"
 
 #include "avcodec.h"
+#include "bitstream.h"
 #include "dv.h"
 #include "dvdata.h"
-#include "get_bits.h"
 #include "idctdsp.h"
 #include "internal.h"
 #include "put_bits.h"
@@ -80,51 +80,34 @@ static av_cold int dvvideo_decode_init(AVCodecContext 
*avctx)
 }
 
 /* decode AC coefficients */
-static void dv_decode_ac(GetBitContext *gb, BlockInfo *mb, int16_t *block)
+static void dv_decode_ac(BitstreamContext *bc, BlockInfo *mb, int16_t *block)
 {
-int last_index = gb->size_in_bits;
 const uint8_t  *scan_table   = mb->scan_table;
 const uint32_t *factor_table = mb->factor_table;
 int pos  = mb->pos;
 int partial_bit_count= mb->partial_bit_count;
-int level, run, vlc_len, index;
-
-OPEN_READER_NOSIZE(re, gb);
-UPDATE_CACHE(re, gb);
+int level, run;
 
 /* if we must parse a partial VLC, we do it here */
 if (partial_bit_count > 0) {
-re_cache  = re_cache >> partial_bit_count |
-mb->partial_bit_buffer;
-re_index -= partial_bit_count;
+bitstream_unget(bc, mb->partial_bit_buffer, partial_bit_count);
 mb->partial_bit_count = 0;
 }
 
 /* get the AC coefficients until last_index is reached */
 for (;;) {
-ff_dlog(NULL, "%2d: bits=%04x index=%u\n", pos, SHOW_UBITS(re, gb, 16),
-re_index);
-/* our own optimized GET_RL_VLC */
-index   = NEG_USR32(re_cache, TEX_VLC_BITS);
-vlc_len = ff_dv_rl_vlc[index].len;
-if (vlc_len < 0) {
-index = NEG_USR32((unsigned) re_cache << TEX_VLC_BITS, -vlc_len) +
-ff_dv_rl_vlc[index].level;
-vlc_len = TEX_VLC_BITS - vlc_len;
-}
-level = ff_dv_rl_vlc[index].level;
-run   = ff_dv_rl_vlc[index].run;
-
-/* gotta check if we're still within gb boundaries */
-if (re_index + vlc_len > last_index) {
-/* should be < 16 bits otherwise a codeword could have been parsed 
*/
-mb->partial_bit_count  = last_index - re_index;
-mb->partial_bit_buffer = re_cache & ~(-1u >> 
mb->partial_bit_count);
-re_index   = last_index;
+BitstreamContext tmp = *bc;
+
+ff_dlog(NULL, "%2d: bits=%04x index=%d\n",
+pos, bitstream_peek(bc, 16), bitstream_tell(bc));
+
+BITSTREAM_RL_VLC(level, run, bc, ff_dv_rl_vlc, TEX_VLC_BITS, 2);
+
+if (bitstream_bits_left(bc) < 0) {
+mb->partial_bit_count  = bitstream_bits_left();
+mb->partial_bit_buffer = bitstream_peek(, 
mb->partial_bit_count);
 break;
 }
-re_index += vlc_len;
-
 ff_dlog(NULL, "run=%d level=%d\n", run, level);
 pos += run;
 if (pos >= 64)
@@ -133,22 +116,22 @@ static void dv_decode_ac(GetBitContext *gb, BlockInfo 
*mb, int16_t *block)
 level = (level * factor_table[pos] + (1 << (dv_iweight_bits - 1))) >>
 dv_iweight_bits;
 block[scan_table[pos]] = level;
-
-UPDATE_CACHE(re, gb);
 }
-CLOSE_READER(re, gb);
 mb->pos = pos;
 }
 
-static inline void bit_copy(PutBitContext *pb, GetBitContext *gb)
+static inline void bit_copy(PutBitContext *pb, BitstreamContext *bc)
 {
-int bits_left = get_bits_left(gb);
-while (bits_left >= MIN_CACHE_BITS) {
-put_bits(pb, MIN_CACHE_BITS, get_bits(gb, MIN_CACHE_BITS));
- 

Re: [libav-devel] [PATCH 6/6] aarch64: vp9itxfm: Fix incorrect vertical alignment

2017-02-09 Thread Diego Biurrun
On Thu, Feb 09, 2017 at 02:30:01PM +0200, Martin Storsjö wrote:
> ---
>  libavcodec/aarch64/vp9itxfm_neon.S | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)

OK

Diego
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

[libav-devel] [PATCH 4/4] aarch64: vp9itxfm: Reorder iadst16 coeffs

2017-02-09 Thread Martin Storsjö
This matches the order they are in the 16 bpp version.

There they are in this order, to make sure we access them in the
same order they are declared, easing loading only half of the
coefficients at a time.

This makes the 8 bpp version match the 16 bpp version better.
---
 libavcodec/aarch64/vp9itxfm_neon.S | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_neon.S 
b/libavcodec/aarch64/vp9itxfm_neon.S
index f87f6bd..7b7dbd4 100644
--- a/libavcodec/aarch64/vp9itxfm_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_neon.S
@@ -37,8 +37,8 @@ idct_coeffs:
 endconst
 
 const iadst16_coeffs, align=4
-.short  16364, 804, 15893, 3981, 14811, 7005, 13160, 9760
-.short  11003, 12140, 8423, 14053, 5520, 15426, 2404, 16207
+.short  16364, 804, 15893, 3981, 11003, 12140, 8423, 14053
+.short  14811, 7005, 13160, 9760, 5520, 15426, 2404, 16207
 endconst
 
 // out1 = ((in1 + in2) * d0[0] + (1 << 13)) >> 14
@@ -622,19 +622,19 @@ function iadst16
 ld1 {v0.8h,v1.8h}, [x11]
 
 dmbutterfly_l   v6,  v7,  v4,  v5,  v31, v16, v0.h[1], v0.h[0]   // 
v6,v7   = t1,   v4,v5   = t0
-dmbutterfly_l   v10, v11, v8,  v9,  v23, v24, v1.h[1], v1.h[0]   // 
v10,v11 = t9,   v8,v9   = t8
+dmbutterfly_l   v10, v11, v8,  v9,  v23, v24, v0.h[5], v0.h[4]   // 
v10,v11 = t9,   v8,v9   = t8
 dbutterfly_nv31, v24, v6,  v7,  v10, v11, v12, v13, v10, v11 // 
v31 = t1a,  v24 = t9a
 dmbutterfly_l   v14, v15, v12, v13, v29, v18, v0.h[3], v0.h[2]   // 
v14,v15 = t3,   v12,v13 = t2
 dbutterfly_nv16, v23, v4,  v5,  v8,  v9,  v6,  v7,  v8,  v9  // 
v16 = t0a,  v23 = t8a
 
-dmbutterfly_l   v6,  v7,  v4,  v5,  v21, v26, v1.h[3], v1.h[2]   // 
v6,v7   = t11,  v4,v5   = t10
+dmbutterfly_l   v6,  v7,  v4,  v5,  v21, v26, v0.h[7], v0.h[6]   // 
v6,v7   = t11,  v4,v5   = t10
 dbutterfly_nv29, v26, v14, v15, v6,  v7,  v8,  v9,  v6,  v7  // 
v29 = t3a,  v26 = t11a
-dmbutterfly_l   v10, v11, v8,  v9,  v27, v20, v0.h[5], v0.h[4]   // 
v10,v11 = t5,   v8,v9   = t4
+dmbutterfly_l   v10, v11, v8,  v9,  v27, v20, v1.h[1], v1.h[0]   // 
v10,v11 = t5,   v8,v9   = t4
 dbutterfly_nv18, v21, v12, v13, v4,  v5,  v6,  v7,  v4,  v5  // 
v18 = t2a,  v21 = t10a
 
 dmbutterfly_l   v14, v15, v12, v13, v19, v28, v1.h[5], v1.h[4]   // 
v14,v15 = t13,  v12,v13 = t12
 dbutterfly_nv20, v28, v10, v11, v14, v15, v4,  v5,  v14, v15 // 
v20 = t5a,  v28 = t13a
-dmbutterfly_l   v6,  v7,  v4,  v5,  v25, v22, v0.h[7], v0.h[6]   // 
v6,v7   = t7,   v4,v5   = t6
+dmbutterfly_l   v6,  v7,  v4,  v5,  v25, v22, v1.h[3], v1.h[2]   // 
v6,v7   = t7,   v4,v5   = t6
 dbutterfly_nv27, v19, v8,  v9,  v12, v13, v10, v11, v12, v13 // 
v27 = t4a,  v19 = t12a
 
 dmbutterfly_l   v10, v11, v8,  v9,  v17, v30, v1.h[7], v1.h[6]   // 
v10,v11 = t15,  v8,v9   = t14
-- 
2.7.4

___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

[libav-devel] [PATCH 2/4] aarch64: vp9itxfm: Reorder the idct coefficients for better pairing

2017-02-09 Thread Martin Storsjö
All elements are used pairwise, except for the first one.
Previously, the 16th element was unused. Move the unused element
to the second slot, to make the later element pairs not split
across registers.

This simplifies loading only parts of the coefficients,
reducing the difference to the 16 bpp version.
---
 libavcodec/aarch64/vp9itxfm_neon.S | 124 ++---
 1 file changed, 62 insertions(+), 62 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_neon.S 
b/libavcodec/aarch64/vp9itxfm_neon.S
index c954d1a..f87f6bd 100644
--- a/libavcodec/aarch64/vp9itxfm_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_neon.S
@@ -22,7 +22,7 @@
 #include "neon.S"
 
 const itxfm4_coeffs, align=4
-.short  11585, 6270, 15137, 0
+.short  11585, 0, 6270, 15137
 iadst4_coeffs:
 .short  5283, 15212, 9929, 13377
 endconst
@@ -30,8 +30,8 @@ endconst
 const iadst8_coeffs, align=4
 .short  16305, 1606, 14449, 7723, 10394, 12665, 4756, 15679
 idct_coeffs:
-.short  11585, 6270, 15137, 3196, 16069, 13623, 9102, 1606
-.short  16305, 12665, 10394, 7723, 14449, 15679, 4756, 0
+.short  11585, 0, 6270, 15137, 3196, 16069, 13623, 9102
+.short  1606, 16305, 12665, 10394, 7723, 14449, 15679, 4756
 .short  804, 16364, 12140, 11003, 7005, 14811, 15426, 5520
 .short  3981, 15893, 14053, 8423, 9760, 13160, 16207, 2404
 endconst
@@ -192,14 +192,14 @@ endconst
 .endm
 
 .macro idct4 c0, c1, c2, c3
-smull   v22.4s,\c1\().4h, v0.h[2]
-smull   v20.4s,\c1\().4h, v0.h[1]
+smull   v22.4s,\c1\().4h, v0.h[3]
+smull   v20.4s,\c1\().4h, v0.h[2]
 add v16.4h,\c0\().4h, \c2\().4h
 sub v17.4h,\c0\().4h, \c2\().4h
-smlal   v22.4s,\c3\().4h, v0.h[1]
+smlal   v22.4s,\c3\().4h, v0.h[2]
 smull   v18.4s,v16.4h,v0.h[0]
 smull   v19.4s,v17.4h,v0.h[0]
-smlsl   v20.4s,\c3\().4h, v0.h[2]
+smlsl   v20.4s,\c3\().4h, v0.h[3]
 rshrn   v22.4h,v22.4s,#14
 rshrn   v18.4h,v18.4s,#14
 rshrn   v19.4h,v19.4s,#14
@@ -326,9 +326,9 @@ itxfm_func4x4 iwht,  iwht
 
 .macro idct8
 dmbutterfly0v16, v20, v16, v20, v2, v3, v4, v5, v6, v7 // v16 = 
t0a, v20 = t1a
-dmbutterfly v18, v22, v0.h[1], v0.h[2], v2, v3, v4, v5 // v18 = 
t2a, v22 = t3a
-dmbutterfly v17, v23, v0.h[3], v0.h[4], v2, v3, v4, v5 // v17 = 
t4a, v23 = t7a
-dmbutterfly v21, v19, v0.h[5], v0.h[6], v2, v3, v4, v5 // v21 = 
t5a, v19 = t6a
+dmbutterfly v18, v22, v0.h[2], v0.h[3], v2, v3, v4, v5 // v18 = 
t2a, v22 = t3a
+dmbutterfly v17, v23, v0.h[4], v0.h[5], v2, v3, v4, v5 // v17 = 
t4a, v23 = t7a
+dmbutterfly v21, v19, v0.h[6], v0.h[7], v2, v3, v4, v5 // v21 = 
t5a, v19 = t6a
 
 butterfly_8hv24, v25, v16, v22 // v24 = t0, v25 = t3
 butterfly_8hv28, v29, v17, v21 // v28 = t4, v29 = t5a
@@ -361,8 +361,8 @@ itxfm_func4x4 iwht,  iwht
 dmbutterfly0v19, v20, v6, v7, v24, v26, v27, v28, v29, v30   // 
v19 = -out[3], v20 = out[4]
 neg v19.8h,   v19.8h  // v19 = out[3]
 
-dmbutterfly_l   v26, v27, v28, v29, v5,  v3,  v0.h[1], v0.h[2]   // 
v26,v27 = t5a, v28,v29 = t4a
-dmbutterfly_l   v2,  v3,  v4,  v5,  v31, v25, v0.h[2], v0.h[1]   // 
v2,v3   = t6a, v4,v5   = t7a
+dmbutterfly_l   v26, v27, v28, v29, v5,  v3,  v0.h[2], v0.h[3]   // 
v26,v27 = t5a, v28,v29 = t4a
+dmbutterfly_l   v2,  v3,  v4,  v5,  v31, v25, v0.h[3], v0.h[2]   // 
v2,v3   = t6a, v4,v5   = t7a
 
 dbutterfly_nv17, v30, v28, v29, v2,  v3,  v6,  v7,  v24, v25 // 
v17 = -out[1], v30 = t6
 dbutterfly_nv22, v31, v26, v27, v4,  v5,  v6,  v7,  v24, v25 // 
v22 = out[6],  v31 = t7
@@ -537,13 +537,13 @@ endfunc
 
 function idct16
 dmbutterfly0v16, v24, v16, v24, v2, v3, v4, v5, v6, v7 // v16 = 
t0a,  v24 = t1a
-dmbutterfly v20, v28, v0.h[1], v0.h[2], v2, v3, v4, v5 // v20 = 
t2a,  v28 = t3a
-dmbutterfly v18, v30, v0.h[3], v0.h[4], v2, v3, v4, v5 // v18 = 
t4a,  v30 = t7a
-dmbutterfly v26, v22, v0.h[5], v0.h[6], v2, v3, v4, v5 // v26 = 
t5a,  v22 = t6a
-dmbutterfly v17, v31, v0.h[7], v1.h[0], v2, v3, v4, v5 // v17 = 
t8a,  v31 = t15a
-dmbutterfly v25, v23, v1.h[1], v1.h[2], v2, v3, v4, v5 // v25 = 
t9a,  v23 = t14a
-dmbutterfly v21, v27, v1.h[3], v1.h[4], v2, v3, v4, v5 // v21 = 
t10a, v27 = t13a
-dmbutterfly v29, v19, v1.h[5], v1.h[6], v2, v3, v4, v5 // v29 = 
t11a, v19 = t12a
+dmbutterfly v20, v28, v0.h[2], v0.h[3], v2, v3, v4, v5 // v20 = 
t2a,  v28 = t3a
+dmbutterfly v18, v30, v0.h[4], v0.h[5], v2, v3, v4, v5 // v18 = 
t4a,  v30 = t7a
+

[libav-devel] [PATCH 3/4] arm: vp9itxfm: Reorder iadst16 coeffs

2017-02-09 Thread Martin Storsjö
This matches the order they are in the 16 bpp version.

There they are in this order, to make sure we access them in the
same order they are declared, easing loading only half of the
coefficients at a time.

This makes the 8 bpp version match the 16 bpp version better.
---
 libavcodec/arm/vp9itxfm_neon.S | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S
index f74d542..c8eeb76 100644
--- a/libavcodec/arm/vp9itxfm_neon.S
+++ b/libavcodec/arm/vp9itxfm_neon.S
@@ -37,8 +37,8 @@ idct_coeffs:
 endconst
 
 const iadst16_coeffs, align=4
-.short  16364, 804, 15893, 3981, 14811, 7005, 13160, 9760
-.short  11003, 12140, 8423, 14053, 5520, 15426, 2404, 16207
+.short  16364, 804, 15893, 3981, 11003, 12140, 8423, 14053
+.short  14811, 7005, 13160, 9760, 5520, 15426, 2404, 16207
 endconst
 
 @ Do four 4x4 transposes, using q registers for the subtransposes that don't
@@ -672,19 +672,19 @@ function iadst16
 vld1.16 {q0-q1}, [r12,:128]
 
 mbutterfly_lq3,  q2,  d31, d16, d0[1], d0[0] @ q3  = t1,   q2  = t0
-mbutterfly_lq5,  q4,  d23, d24, d2[1], d2[0] @ q5  = t9,   q4  = t8
+mbutterfly_lq5,  q4,  d23, d24, d1[1], d1[0] @ q5  = t9,   q4  = t8
 butterfly_n d31, d24, q3,  q5,  q6,  q5  @ d31 = t1a,  d24 = 
t9a
 mbutterfly_lq7,  q6,  d29, d18, d0[3], d0[2] @ q7  = t3,   q6  = t2
 butterfly_n d16, d23, q2,  q4,  q3,  q4  @ d16 = t0a,  d23 = 
t8a
 
-mbutterfly_lq3,  q2,  d21, d26, d2[3], d2[2] @ q3  = t11,  q2  = 
t10
+mbutterfly_lq3,  q2,  d21, d26, d1[3], d1[2] @ q3  = t11,  q2  = 
t10
 butterfly_n d29, d26, q7,  q3,  q4,  q3  @ d29 = t3a,  d26 = 
t11a
-mbutterfly_lq5,  q4,  d27, d20, d1[1], d1[0] @ q5  = t5,   q4  = t4
+mbutterfly_lq5,  q4,  d27, d20, d2[1], d2[0] @ q5  = t5,   q4  = t4
 butterfly_n d18, d21, q6,  q2,  q3,  q2  @ d18 = t2a,  d21 = 
t10a
 
 mbutterfly_lq7,  q6,  d19, d28, d3[1], d3[0] @ q7  = t13,  q6  = 
t12
 butterfly_n d20, d28, q5,  q7,  q2,  q7  @ d20 = t5a,  d28 = 
t13a
-mbutterfly_lq3,  q2,  d25, d22, d1[3], d1[2] @ q3  = t7,   q2  = t6
+mbutterfly_lq3,  q2,  d25, d22, d2[3], d2[2] @ q3  = t7,   q2  = t6
 butterfly_n d27, d19, q4,  q6,  q5,  q6  @ d27 = t4a,  d19 = 
t12a
 
 mbutterfly_lq5,  q4,  d17, d30, d3[3], d3[2] @ q5  = t15,  q4  = 
t14
-- 
2.7.4

___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

[libav-devel] [PATCH 1/4] arm: vp9itxfm: Reorder the idct coefficients for better pairing

2017-02-09 Thread Martin Storsjö
All elements are used pairwise, except for the first one.
Previously, the 16th element was unused. Move the unused element
to the second slot, to make the later element pairs not split
across registers.

This simplifies loading only parts of the coefficients,
reducing the difference to the 16 bpp version.
---
The 16 bpp version is only in ffmpeg for now, since libav's vp9
decoder doesn't support the high bitdepth profiles. This change
in itself still makes sense to do though.
---
 libavcodec/arm/vp9itxfm_neon.S | 124 -
 1 file changed, 62 insertions(+), 62 deletions(-)

diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S
index 167d517..f74d542 100644
--- a/libavcodec/arm/vp9itxfm_neon.S
+++ b/libavcodec/arm/vp9itxfm_neon.S
@@ -22,7 +22,7 @@
 #include "neon.S"
 
 const itxfm4_coeffs, align=4
-.short  11585, 6270, 15137, 0
+.short  11585, 0, 6270, 15137
 iadst4_coeffs:
 .short  5283, 15212, 9929, 13377
 endconst
@@ -30,8 +30,8 @@ endconst
 const iadst8_coeffs, align=4
 .short  16305, 1606, 14449, 7723, 10394, 12665, 4756, 15679
 idct_coeffs:
-.short  11585, 6270, 15137, 3196, 16069, 13623, 9102, 1606
-.short  16305, 12665, 10394, 7723, 14449, 15679, 4756, 0
+.short  11585, 0, 6270, 15137, 3196, 16069, 13623, 9102
+.short  1606, 16305, 12665, 10394, 7723, 14449, 15679, 4756
 .short  804, 16364, 12140, 11003, 7005, 14811, 15426, 5520
 .short  3981, 15893, 14053, 8423, 9760, 13160, 16207, 2404
 endconst
@@ -224,14 +224,14 @@ endconst
 .endm
 
 .macro idct4 c0, c1, c2, c3
-vmull.s16   q13,  \c1,  d0[2]
-vmull.s16   q11,  \c1,  d0[1]
+vmull.s16   q13,  \c1,  d0[3]
+vmull.s16   q11,  \c1,  d0[2]
 vadd.i16d16,  \c0,  \c2
 vsub.i16d17,  \c0,  \c2
-vmlal.s16   q13,  \c3,  d0[1]
+vmlal.s16   q13,  \c3,  d0[2]
 vmull.s16   q9,   d16,  d0[0]
 vmull.s16   q10,  d17,  d0[0]
-vmlsl.s16   q11,  \c3,  d0[2]
+vmlsl.s16   q11,  \c3,  d0[3]
 vrshrn.s32  d26,  q13,  #14
 vrshrn.s32  d18,  q9,   #14
 vrshrn.s32  d20,  q10,  #14
@@ -350,9 +350,9 @@ itxfm_func4x4 iwht,  iwht
 
 .macro idct8
 dmbutterfly0d16, d17, d24, d25, q8,  q12, q2, q4, d4, d5, d8, d9, 
q3, q2, q5, q4 @ q8 = t0a, q12 = t1a
-dmbutterfly d20, d21, d28, d29, d0[1], d0[2], q2,  q3,  q4,  q5 @ 
q10 = t2a, q14 = t3a
-dmbutterfly d18, d19, d30, d31, d0[3], d1[0], q2,  q3,  q4,  q5 @ 
q9  = t4a, q15 = t7a
-dmbutterfly d26, d27, d22, d23, d1[1], d1[2], q2,  q3,  q4,  q5 @ 
q13 = t5a, q11 = t6a
+dmbutterfly d20, d21, d28, d29, d0[2], d0[3], q2,  q3,  q4,  q5 @ 
q10 = t2a, q14 = t3a
+dmbutterfly d18, d19, d30, d31, d1[0], d1[1], q2,  q3,  q4,  q5 @ 
q9  = t4a, q15 = t7a
+dmbutterfly d26, d27, d22, d23, d1[2], d1[3], q2,  q3,  q4,  q5 @ 
q13 = t5a, q11 = t6a
 
 butterfly   q2,  q14, q8,  q14 @ q2 = t0, q14 = t3
 butterfly   q3,  q10, q12, q10 @ q3 = t1, q10 = t2
@@ -386,8 +386,8 @@ itxfm_func4x4 iwht,  iwht
 vneg.s16q15, q15  @ q15 = out[7]
 butterfly   q8,  q9,  q11, q9 @ q8 = out[0], q9 = t2
 
-dmbutterfly_l   q10, q11, q5,  q7,  d4,  d5,  d6,  d7,  d0[1], d0[2] @ 
q10,q11 = t5a, q5,q7 = t4a
-dmbutterfly_l   q2,  q3,  q13, q14, d12, d13, d8,  d9,  d0[2], d0[1] @ 
q2,q3 = t6a, q13,q14 = t7a
+dmbutterfly_l   q10, q11, q5,  q7,  d4,  d5,  d6,  d7,  d0[2], d0[3] @ 
q10,q11 = t5a, q5,q7 = t4a
+dmbutterfly_l   q2,  q3,  q13, q14, d12, d13, d8,  d9,  d0[3], d0[2] @ 
q2,q3 = t6a, q13,q14 = t7a
 
 dbutterfly_nd28, d29, d8,  d9,  q10, q11, q13, q14, q4,  q6,  q10, 
q11 @ q14 = out[6], q4 = t7
 
@@ -588,13 +588,13 @@ endfunc
 
 function idct16
 mbutterfly0 d16, d24, d16, d24, d4, d6,  q2,  q3 @ d16 = t0a,  d24 
= t1a
-mbutterfly  d20, d28, d0[1], d0[2], q2,  q3  @ d20 = t2a,  d28 = 
t3a
-mbutterfly  d18, d30, d0[3], d1[0], q2,  q3  @ d18 = t4a,  d30 = 
t7a
-mbutterfly  d26, d22, d1[1], d1[2], q2,  q3  @ d26 = t5a,  d22 = 
t6a
-mbutterfly  d17, d31, d1[3], d2[0], q2,  q3  @ d17 = t8a,  d31 = 
t15a
-mbutterfly  d25, d23, d2[1], d2[2], q2,  q3  @ d25 = t9a,  d23 = 
t14a
-mbutterfly  d21, d27, d2[3], d3[0], q2,  q3  @ d21 = t10a, d27 = 
t13a
-mbutterfly  d29, d19, d3[1], d3[2], q2,  q3  @ d29 = t11a, d19 = 
t12a
+mbutterfly  d20, d28, d0[2], d0[3], q2,  q3  @ d20 = t2a,  d28 = 
t3a
+mbutterfly  d18, d30, d1[0], d1[1], q2,  q3  @ d18 = t4a,  d30 = 
t7a
+mbutterfly  d26, d22, d1[2], d1[3], q2,  q3  @ d26 = t5a,  d22 = 
t6a
+mbutterfly  d17, d31, d2[0], d2[1], q2,  q3  @ d17 = t8a,  d31 = 
t15a
+mbutterfly  d25, d23, d2[2], 

[libav-devel] [PATCH 3/6] aarch64: vp9itxfm: Use a single lane ld1 instead of ld1r where possible

2017-02-09 Thread Martin Storsjö
The ld1r is a leftover from the arm version, where this trick is
beneficial on some cores.

Use a single-lane load where we don't need the semantics of ld1r.
---
 libavcodec/aarch64/vp9itxfm_neon.S | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_neon.S 
b/libavcodec/aarch64/vp9itxfm_neon.S
index a9c7626..e7b8836 100644
--- a/libavcodec/aarch64/vp9itxfm_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_neon.S
@@ -255,7 +255,7 @@ function ff_vp9_\txfm1\()_\txfm2\()_4x4_add_neon, export=1
 cmp w3,  #1
 b.ne1f
 // DC-only for idct/idct
-ld1r{v2.4h},  [x2]
+ld1 {v2.h}[0], [x2]
 smull   v2.4s,  v2.4h, v0.h[0]
 rshrn   v2.4h,  v2.4s, #14
 smull   v2.4s,  v2.4h, v0.h[0]
@@ -287,8 +287,8 @@ function ff_vp9_\txfm1\()_\txfm2\()_4x4_add_neon, export=1
 
 \txfm2\()4  v4,  v5,  v6,  v7
 2:
-ld1r{v0.2s},   [x0], x1
-ld1r{v1.2s},   [x0], x1
+ld1 {v0.s}[0],   [x0], x1
+ld1 {v1.s}[0],   [x0], x1
 .ifnc \txfm1,iwht
 srshr   v4.4h,  v4.4h,  #4
 srshr   v5.4h,  v5.4h,  #4
@@ -297,8 +297,8 @@ function ff_vp9_\txfm1\()_\txfm2\()_4x4_add_neon, export=1
 .endif
 uaddw   v4.8h,  v4.8h,  v0.8b
 uaddw   v5.8h,  v5.8h,  v1.8b
-ld1r{v2.2s},   [x0], x1
-ld1r{v3.2s},   [x0], x1
+ld1 {v2.s}[0],   [x0], x1
+ld1 {v3.s}[0],   [x0], x1
 sqxtun  v0.8b,  v4.8h
 sqxtun  v1.8b,  v5.8h
 sub x0,  x0,  x1, lsl #2
@@ -394,7 +394,7 @@ function ff_vp9_\txfm1\()_\txfm2\()_8x8_add_neon, export=1
 cmp w3,  #1
 b.ne1f
 // DC-only for idct/idct
-ld1r{v2.4h},  [x2]
+ld1 {v2.h}[0],  [x2]
 smull   v2.4s,  v2.4h, v0.h[0]
 rshrn   v2.4h,  v2.4s, #14
 smull   v2.4s,  v2.4h, v0.h[0]
@@ -485,7 +485,7 @@ function idct16x16_dc_add_neon
 
 moviv1.4h, #0
 
-ld1r{v2.4h}, [x2]
+ld1 {v2.h}[0], [x2]
 smull   v2.4s,  v2.4h, v0.h[0]
 rshrn   v2.4h,  v2.4s, #14
 smull   v2.4s,  v2.4h, v0.h[0]
@@ -1044,7 +1044,7 @@ function idct32x32_dc_add_neon
 
 moviv1.4h, #0
 
-ld1r{v2.4h}, [x2]
+ld1 {v2.h}[0], [x2]
 smull   v2.4s,  v2.4h,  v0.h[0]
 rshrn   v2.4h,  v2.4s,  #14
 smull   v2.4s,  v2.4h,  v0.h[0]
-- 
2.7.4

___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

[libav-devel] [PATCH 4/6] aarch64: vp9itxfm: Use the right lane sizes in 8x8 for improved readability

2017-02-09 Thread Martin Storsjö
---
 libavcodec/aarch64/vp9itxfm_neon.S | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_neon.S 
b/libavcodec/aarch64/vp9itxfm_neon.S
index e7b8836..7582081 100644
--- a/libavcodec/aarch64/vp9itxfm_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_neon.S
@@ -385,10 +385,10 @@ function ff_vp9_\txfm1\()_\txfm2\()_8x8_add_neon, export=1
 .endif
 ld1 {v0.8h}, [x4]
 
-moviv2.16b, #0
-moviv3.16b, #0
-moviv4.16b, #0
-moviv5.16b, #0
+moviv2.8h, #0
+moviv3.8h, #0
+moviv4.8h, #0
+moviv5.8h, #0
 
 .ifc \txfm1\()_\txfm2,idct_idct
 cmp w3,  #1
@@ -411,11 +411,11 @@ function ff_vp9_\txfm1\()_\txfm2\()_8x8_add_neon, export=1
 b   2f
 .endif
 1:
-ld1 {v16.16b,v17.16b,v18.16b,v19.16b},  [x2], #64
-ld1 {v20.16b,v21.16b,v22.16b,v23.16b},  [x2], #64
+ld1 {v16.8h,v17.8h,v18.8h,v19.8h},  [x2], #64
+ld1 {v20.8h,v21.8h,v22.8h,v23.8h},  [x2], #64
 sub x2,  x2,  #128
-st1 {v2.16b,v3.16b,v4.16b,v5.16b},  [x2], #64
-st1 {v2.16b,v3.16b,v4.16b,v5.16b},  [x2], #64
+st1 {v2.8h,v3.8h,v4.8h,v5.8h},  [x2], #64
+st1 {v2.8h,v3.8h,v4.8h,v5.8h},  [x2], #64
 
 \txfm1\()8
 
-- 
2.7.4

___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

[libav-devel] [PATCH 5/6] aarch64: vp9itxfm: Update a comment to refer to a register with a different name

2017-02-09 Thread Martin Storsjö
---
 libavcodec/aarch64/vp9itxfm_neon.S | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_neon.S 
b/libavcodec/aarch64/vp9itxfm_neon.S
index 7582081..8102720 100644
--- a/libavcodec/aarch64/vp9itxfm_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_neon.S
@@ -41,8 +41,8 @@ const iadst16_coeffs, align=4
 .short  11003, 12140, 8423, 14053, 5520, 15426, 2404, 16207
 endconst
 
-// out1 = ((in1 + in2) * d0[0] + (1 << 13)) >> 14
-// out2 = ((in1 - in2) * d0[0] + (1 << 13)) >> 14
+// out1 = ((in1 + in2) * v0[0] + (1 << 13)) >> 14
+// out2 = ((in1 - in2) * v0[0] + (1 << 13)) >> 14
 // in/out are .8h registers; this can do with 4 temp registers, but is
 // more efficient if 6 temp registers are available.
 .macro dmbutterfly0 out1, out2, in1, in2, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, 
neg=0
-- 
2.7.4

___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

[libav-devel] [PATCH 6/6] aarch64: vp9itxfm: Fix incorrect vertical alignment

2017-02-09 Thread Martin Storsjö
---
 libavcodec/aarch64/vp9itxfm_neon.S | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_neon.S 
b/libavcodec/aarch64/vp9itxfm_neon.S
index 8102720..a199e9c 100644
--- a/libavcodec/aarch64/vp9itxfm_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_neon.S
@@ -225,7 +225,7 @@ endconst
 add v21.4s,v17.4s,v19.4s
 rshrn   \c0\().4h, v20.4s,#14
 add v16.4s,v16.4s,v17.4s
-rshrn   \c1\().4h, v21.4s, #14
+rshrn   \c1\().4h, v21.4s,#14
 sub v16.4s,v16.4s,v19.4s
 rshrn   \c2\().4h, v18.4s,#14
 rshrn   \c3\().4h, v16.4s,#14
@@ -1313,8 +1313,8 @@ function idct32_1d_8x32_pass1\suffix\()_neon
 
 bl  idct32_odd\suffix
 
-transpose_8x8H v31, v30, v29, v28, v27, v26, v25, v24, v2, v3
-transpose_8x8H v23, v22, v21, v20, v19, v18, v17, v16, v2, v3
+transpose_8x8H  v31, v30, v29, v28, v27, v26, v25, v24, v2, v3
+transpose_8x8H  v23, v22, v21, v20, v19, v18, v17, v16, v2, v3
 
 // Store the registers a, b horizontally,
 // adding into the output first, and the mirrored,
-- 
2.7.4

___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

[libav-devel] [PATCH 2/6] aarch64: vp9itxfm: Share instructions for loading idct coeffs in the 8x8 function

2017-02-09 Thread Martin Storsjö
---
 libavcodec/aarch64/vp9itxfm_neon.S | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_neon.S 
b/libavcodec/aarch64/vp9itxfm_neon.S
index c954d1a..a9c7626 100644
--- a/libavcodec/aarch64/vp9itxfm_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_neon.S
@@ -379,12 +379,11 @@ function ff_vp9_\txfm1\()_\txfm2\()_8x8_add_neon, export=1
 // idct, so those always need to be loaded.
 .ifc \txfm1\()_\txfm2,idct_idct
 movrel  x4,  idct_coeffs
-ld1 {v0.8h}, [x4]
 .else
 movrel  x4, iadst8_coeffs
 ld1 {v1.8h}, [x4], #16
-ld1 {v0.8h}, [x4]
 .endif
+ld1 {v0.8h}, [x4]
 
 moviv2.16b, #0
 moviv3.16b, #0
-- 
2.7.4

___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

[libav-devel] [PATCH 1/6] arm: vp9itxfm: Share instructions for loading idct coeffs in the 8x8 function

2017-02-09 Thread Martin Storsjö
---
 libavcodec/arm/vp9itxfm_neon.S | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S
index 167d517..3d0b0fa 100644
--- a/libavcodec/arm/vp9itxfm_neon.S
+++ b/libavcodec/arm/vp9itxfm_neon.S
@@ -412,13 +412,12 @@ function ff_vp9_\txfm1\()_\txfm2\()_8x8_add_neon, export=1
 .ifc \txfm1\()_\txfm2,idct_idct
 movrel  r12, idct_coeffs
 vpush   {q4-q5}
-vld1.16 {q0}, [r12,:128]
 .else
 movrel  r12, iadst8_coeffs
 vld1.16 {q1}, [r12,:128]!
 vpush   {q4-q7}
-vld1.16 {q0}, [r12,:128]
 .endif
+vld1.16 {q0}, [r12,:128]
 
 vmov.i16q2, #0
 vmov.i16q3, #0
-- 
2.7.4

___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

[libav-devel] [PATCH] arm: vp9itxfm: Avoid reloading the idct32 coefficients

2017-02-09 Thread Martin Storsjö
The idct32x32 function actually backed up and restored q4-q7 even
though it didn't clobber them; there are plenty of registers that
can be used to allow keeping all the idct coefficients in registers
without having to reload different subsets of them at different
stages in the transform.

Since the idct16 core transform avoids clobbering q4-q7 (but clobbers
q2-q3 instead, to avoid needing to back up and restore q4-q7 at all
in the idct16 function), and the lanewise vmul needs a register in
the q0-q3 range, we move the stored coefficients from q2-q3 into q4-q5
while doing idct16.

While keeping these coefficients in registers, we still can skip backing
up and restoring q7.

Before:  Cortex A7   A8   A9  A53
vp9_inv_dct_dct_32x32_sub32_add_neon:  18553.8  17182.7  14303.3  12089.7
After:
vp9_inv_dct_dct_32x32_sub32_add_neon:  18470.3  16717.7  14173.6  11860.8
---
 libavcodec/arm/vp9itxfm_neon.S | 246 -
 1 file changed, 120 insertions(+), 126 deletions(-)

diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S
index 167d517..df3f923 100644
--- a/libavcodec/arm/vp9itxfm_neon.S
+++ b/libavcodec/arm/vp9itxfm_neon.S
@@ -1168,58 +1168,51 @@ function idct32x32_dc_add_neon
 endfunc
 
 .macro idct32_end
-butterfly   d16, d5,  d4,  d5  @ d16 = t16a, d5  = t19a
+butterfly   d16, d9,  d8,  d9  @ d16 = t16a, d9  = t19a
 butterfly   d17, d20, d23, d20 @ d17 = t17,  d20 = t18
-butterfly   d18, d6,  d7,  d6  @ d18 = t23a, d6  = t20a
+butterfly   d18, d10, d11, d10 @ d18 = t23a, d10 = t20a
 butterfly   d19, d21, d22, d21 @ d19 = t22,  d21 = t21
-butterfly   d4,  d28, d28, d30 @ d4  = t24a, d28 = t27a
+butterfly   d8,  d28, d28, d30 @ d8  = t24a, d28 = t27a
 butterfly   d23, d26, d25, d26 @ d23 = t25,  d26 = t26
-butterfly   d7,  d29, d29, d31 @ d7  = t31a, d29 = t28a
+butterfly   d11, d29, d29, d31 @ d11 = t31a, d29 = t28a
 butterfly   d22, d27, d24, d27 @ d22 = t30,  d27 = t29
 
 mbutterfly  d27, d20, d0[1], d0[2], q12, q15@ d27 = t18a, 
d20 = t29a
-mbutterfly  d29, d5,  d0[1], d0[2], q12, q15@ d29 = t19,  
d5  = t28
-mbutterfly  d28, d6,  d0[1], d0[2], q12, q15, neg=1 @ d28 = t27,  
d6  = t20
+mbutterfly  d29, d9,  d0[1], d0[2], q12, q15@ d29 = t19,  
d9  = t28
+mbutterfly  d28, d10, d0[1], d0[2], q12, q15, neg=1 @ d28 = t27,  
d10 = t20
 mbutterfly  d26, d21, d0[1], d0[2], q12, q15, neg=1 @ d26 = t26a, 
d21 = t21a
 
-butterfly   d31, d24, d7,  d4  @ d31 = t31,  d24 = t24
+butterfly   d31, d24, d11, d8  @ d31 = t31,  d24 = t24
 butterfly   d30, d25, d22, d23 @ d30 = t30a, d25 = t25a
 butterfly_r d23, d16, d16, d18 @ d23 = t23,  d16 = t16
 butterfly_r d22, d17, d17, d19 @ d22 = t22a, d17 = t17a
 butterfly   d18, d21, d27, d21 @ d18 = t18,  d21 = t21
-butterfly_r d27, d28, d5,  d28 @ d27 = t27a, d28 = t28a
-butterfly   d4,  d26, d20, d26 @ d4  = t29,  d26 = t26
-butterfly   d19, d20, d29, d6  @ d19 = t19a, d20 = t20
-vmovd29, d4@ d29 = t29
-
-mbutterfly0 d27, d20, d27, d20, d4, d6, q2, q3 @ d27 = t27,  d20 = 
t20
-mbutterfly0 d26, d21, d26, d21, d4, d6, q2, q3 @ d26 = t26a, d21 = 
t21a
-mbutterfly0 d25, d22, d25, d22, d4, d6, q2, q3 @ d25 = t25,  d22 = 
t22
-mbutterfly0 d24, d23, d24, d23, d4, d6, q2, q3 @ d24 = t24a, d23 = 
t23a
+butterfly_r d27, d28, d9,  d28 @ d27 = t27a, d28 = t28a
+butterfly   d8,  d26, d20, d26 @ d8  = t29,  d26 = t26
+butterfly   d19, d20, d29, d10 @ d19 = t19a, d20 = t20
+vmovd29, d8@ d29 = t29
+
+mbutterfly0 d27, d20, d27, d20, d8, d10, q4, q5 @ d27 = t27,  d20 
= t20
+mbutterfly0 d26, d21, d26, d21, d8, d10, q4, q5 @ d26 = t26a, d21 
= t21a
+mbutterfly0 d25, d22, d25, d22, d8, d10, q4, q5 @ d25 = t25,  d22 
= t22
+mbutterfly0 d24, d23, d24, d23, d8, d10, q4, q5 @ d24 = t24a, d23 
= t23a
 bx  lr
 .endm
 
 function idct32_odd
-movrel  r12, idct_coeffs
-add r12, r12, #32
-vld1.16 {q0-q1}, [r12,:128]
-
-mbutterfly  d16, d31, d0[0], d0[1], q2, q3 @ d16 = t16a, d31 = t31a
-mbutterfly  d24, d23, d0[2], d0[3], q2, q3 @ d24 = t17a, d23 = t30a
-mbutterfly  d20, d27, d1[0], d1[1], q2, q3 @ d20 = t18a, d27 = t29a
-mbutterfly  d28, d19, d1[2], d1[3], q2, q3 @ d28 = t19a, d19 = t28a
-mbutterfly  d18, d29, d2[0], d2[1], q2, q3 @ d18 = t20a, d29 = t27a
-mbutterfly  d26, d21, d2[2], d2[3], q2, q3 @ d26 = t21a, d21 = t26a
-

[libav-devel] [PATCH] aarch64: vp9itxfm: Avoid reloading the idct32 coefficients

2017-02-09 Thread Martin Storsjö
The idct32x32 function actually backed up and restored d8-d15 even
though it didn't clobber them; there are plenty of registers that
can be used to allow keeping all the idct coefficients in registers
without having to reload different subsets of them at different
stages in the transform.

After this, we still can skip backing up and restoring d12-d15.

Before:
vp9_inv_dct_dct_32x32_sub32_add_neon: 8128.3
After:
vp9_inv_dct_dct_32x32_sub32_add_neon: 8053.3
---
 libavcodec/aarch64/vp9itxfm_neon.S | 110 +++--
 1 file changed, 43 insertions(+), 67 deletions(-)

diff --git a/libavcodec/aarch64/vp9itxfm_neon.S 
b/libavcodec/aarch64/vp9itxfm_neon.S
index c954d1a..64286df 100644
--- a/libavcodec/aarch64/vp9itxfm_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_neon.S
@@ -1106,18 +1106,14 @@ endfunc
 .endm
 
 function idct32_odd
-ld1 {v0.8h,v1.8h}, [x11]
-
-dmbutterfly v16, v31, v0.h[0], v0.h[1], v4, v5, v6, v7 // v16 = 
t16a, v31 = t31a
-dmbutterfly v24, v23, v0.h[2], v0.h[3], v4, v5, v6, v7 // v24 = 
t17a, v23 = t30a
-dmbutterfly v20, v27, v0.h[4], v0.h[5], v4, v5, v6, v7 // v20 = 
t18a, v27 = t29a
-dmbutterfly v28, v19, v0.h[6], v0.h[7], v4, v5, v6, v7 // v28 = 
t19a, v19 = t28a
-dmbutterfly v18, v29, v1.h[0], v1.h[1], v4, v5, v6, v7 // v18 = 
t20a, v29 = t27a
-dmbutterfly v26, v21, v1.h[2], v1.h[3], v4, v5, v6, v7 // v26 = 
t21a, v21 = t26a
-dmbutterfly v22, v25, v1.h[4], v1.h[5], v4, v5, v6, v7 // v22 = 
t22a, v25 = t25a
-dmbutterfly v30, v17, v1.h[6], v1.h[7], v4, v5, v6, v7 // v30 = 
t23a, v17 = t24a
-
-ld1 {v0.8h}, [x10]
+dmbutterfly v16, v31, v8.h[0], v8.h[1], v4, v5, v6, v7 // v16 = 
t16a, v31 = t31a
+dmbutterfly v24, v23, v8.h[2], v8.h[3], v4, v5, v6, v7 // v24 = 
t17a, v23 = t30a
+dmbutterfly v20, v27, v8.h[4], v8.h[5], v4, v5, v6, v7 // v20 = 
t18a, v27 = t29a
+dmbutterfly v28, v19, v8.h[6], v8.h[7], v4, v5, v6, v7 // v28 = 
t19a, v19 = t28a
+dmbutterfly v18, v29, v9.h[0], v9.h[1], v4, v5, v6, v7 // v18 = 
t20a, v29 = t27a
+dmbutterfly v26, v21, v9.h[2], v9.h[3], v4, v5, v6, v7 // v26 = 
t21a, v21 = t26a
+dmbutterfly v22, v25, v9.h[4], v9.h[5], v4, v5, v6, v7 // v22 = 
t22a, v25 = t25a
+dmbutterfly v30, v17, v9.h[6], v9.h[7], v4, v5, v6, v7 // v30 = 
t23a, v17 = t24a
 
 butterfly_8hv4,  v24, v16, v24 // v4  = t16, v24 = t17
 butterfly_8hv5,  v20, v28, v20 // v5  = t19, v20 = t18
@@ -1136,18 +1132,14 @@ function idct32_odd
 endfunc
 
 function idct32_odd_half
-ld1 {v0.8h,v1.8h}, [x11]
-
-dmbutterfly_h1  v16, v31, v0.h[0], v0.h[1], v4, v5, v6, v7 // v16 = 
t16a, v31 = t31a
-dmbutterfly_h2  v24, v23, v0.h[2], v0.h[3], v4, v5, v6, v7 // v24 = 
t17a, v23 = t30a
-dmbutterfly_h1  v20, v27, v0.h[4], v0.h[5], v4, v5, v6, v7 // v20 = 
t18a, v27 = t29a
-dmbutterfly_h2  v28, v19, v0.h[6], v0.h[7], v4, v5, v6, v7 // v28 = 
t19a, v19 = t28a
-dmbutterfly_h1  v18, v29, v1.h[0], v1.h[1], v4, v5, v6, v7 // v18 = 
t20a, v29 = t27a
-dmbutterfly_h2  v26, v21, v1.h[2], v1.h[3], v4, v5, v6, v7 // v26 = 
t21a, v21 = t26a
-dmbutterfly_h1  v22, v25, v1.h[4], v1.h[5], v4, v5, v6, v7 // v22 = 
t22a, v25 = t25a
-dmbutterfly_h2  v30, v17, v1.h[6], v1.h[7], v4, v5, v6, v7 // v30 = 
t23a, v17 = t24a
-
-ld1 {v0.8h}, [x10]
+dmbutterfly_h1  v16, v31, v8.h[0], v8.h[1], v4, v5, v6, v7 // v16 = 
t16a, v31 = t31a
+dmbutterfly_h2  v24, v23, v8.h[2], v8.h[3], v4, v5, v6, v7 // v24 = 
t17a, v23 = t30a
+dmbutterfly_h1  v20, v27, v8.h[4], v8.h[5], v4, v5, v6, v7 // v20 = 
t18a, v27 = t29a
+dmbutterfly_h2  v28, v19, v8.h[6], v8.h[7], v4, v5, v6, v7 // v28 = 
t19a, v19 = t28a
+dmbutterfly_h1  v18, v29, v9.h[0], v9.h[1], v4, v5, v6, v7 // v18 = 
t20a, v29 = t27a
+dmbutterfly_h2  v26, v21, v9.h[2], v9.h[3], v4, v5, v6, v7 // v26 = 
t21a, v21 = t26a
+dmbutterfly_h1  v22, v25, v9.h[4], v9.h[5], v4, v5, v6, v7 // v22 = 
t22a, v25 = t25a
+dmbutterfly_h2  v30, v17, v9.h[6], v9.h[7], v4, v5, v6, v7 // v30 = 
t23a, v17 = t24a
 
 butterfly_8hv4,  v24, v16, v24 // v4  = t16, v24 = t17
 butterfly_8hv5,  v20, v28, v20 // v5  = t19, v20 = t18
@@ -1166,18 +1158,14 @@ function idct32_odd_half
 endfunc
 
 function idct32_odd_quarter
-ld1 {v0.8h,v1.8h}, [x11]
-
-dsmull_hv4,  v5,  v16, v0.h[0]
-dsmull_hv28, v29, v19, v0.h[7]
-dsmull_hv30, v31, v16, v0.h[1]
-dsmull_hv22, v23, v17, v1.h[6]
-dsmull_hv7,  v6,  v17, v1.h[7]
-dsmull_hv26, v27, v19, v0.h[6]
-dsmull_hv20, v21, v18, v1.h[0]
-dsmull_hv24, v25, v18, v1.h[1]
-
-ld1 {v0.8h}, [x10]
+ 

Re: [libav-devel] [PATCH] hlsenc: Correctly write down all 16 bytes in hex

2017-02-09 Thread Luca Barbato
On 08/02/2017 13:42, Luca Barbato wrote:
> ---
> 
>  libavformat/hlsenc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/libavformat/hlsenc.c b/libavformat/hlsenc.c
> index 05c9adb..3496bdd 100644
> --- a/libavformat/hlsenc.c
> +++ b/libavformat/hlsenc.c
> @@ -106,7 +106,7 @@ static int dict_set_bin(AVDictionary **dict, const char 
> *key, uint8_t *buf)
>  {
>  char hex[33];
> 
> -ff_data_to_hex(hex, buf, sizeof(buf), 0);
> +ff_data_to_hex(hex, buf, 16, 0);
>  hex[32] = '\0';
> 
>  return av_dict_set(dict, key, hex, 0);
> --
> 2.9.2
> 

Ping.

___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

[libav-devel] [PATCH] hwcontext_dxva2: support D3D9Ex

2017-02-09 Thread wm4
D3D9Ex uses different driver paths. This helps with "headless"
configurations when no user logs in. Plain D3D9 device creation will
fail if no user is logged in, while it works with D3D9Ex.
---
 libavutil/hwcontext_dxva2.c | 117 
 1 file changed, 87 insertions(+), 30 deletions(-)

diff --git a/libavutil/hwcontext_dxva2.c b/libavutil/hwcontext_dxva2.c
index ccf03c8e9f..3790bed4b7 100644
--- a/libavutil/hwcontext_dxva2.c
+++ b/libavutil/hwcontext_dxva2.c
@@ -38,8 +38,22 @@
 #include "pixfmt.h"
 
 typedef IDirect3D9* WINAPI pDirect3DCreate9(UINT);
+typedef HRESULT WINAPI pDirect3DCreate9Ex(UINT, IDirect3D9Ex **);
 typedef HRESULT WINAPI pCreateDeviceManager9(UINT *, IDirect3DDeviceManager9 
**);
 
+#define FF_D3DCREATE_FLAGS (D3DCREATE_SOFTWARE_VERTEXPROCESSING | \
+D3DCREATE_MULTITHREADED | \
+D3DCREATE_FPU_PRESERVE)
+
+static const D3DPRESENT_PARAMETERS dxva2_present_params = {
+.Windowed = TRUE,
+.BackBufferWidth  = 640,
+.BackBufferHeight = 480,
+.BackBufferCount  = 0,
+.SwapEffect   = D3DSWAPEFFECT_DISCARD,
+.Flags= D3DPRESENTFLAG_VIDEO,
+};
+
 typedef struct DXVA2Mapping {
 uint32_t palette_dummy[256];
 } DXVA2Mapping;
@@ -411,19 +425,83 @@ static void dxva2_device_free(AVHWDeviceContext *ctx)
 av_freep(>user_opaque);
 }
 
+static int dxva2_device_create9(AVHWDeviceContext *ctx, UINT adapter)
+{
+DXVA2DevicePriv *priv = ctx->user_opaque;
+D3DPRESENT_PARAMETERS d3dpp = dxva2_present_params;
+D3DDISPLAYMODE d3ddm;
+HRESULT hr;
+pDirect3DCreate9 *createD3D = (pDirect3DCreate9 
*)GetProcAddress(priv->d3dlib, "Direct3DCreate9");
+if (!createD3D) {
+av_log(ctx, AV_LOG_ERROR, "Failed to locate Direct3DCreate9\n");
+return AVERROR_UNKNOWN;
+}
+
+priv->d3d9 = createD3D(D3D_SDK_VERSION);
+if (!priv->d3d9) {
+av_log(ctx, AV_LOG_ERROR, "Failed to create IDirect3D object\n");
+return AVERROR_UNKNOWN;
+}
+
+IDirect3D9_GetAdapterDisplayMode(priv->d3d9, adapter, );
+
+d3dpp.BackBufferFormat = d3ddm.Format;
+
+hr = IDirect3D9_CreateDevice(priv->d3d9, adapter, D3DDEVTYPE_HAL, 
GetShellWindow(),
+FF_D3DCREATE_FLAGS,
+, >d3d9device);
+if (FAILED(hr)) {
+av_log(ctx, AV_LOG_ERROR, "Failed to create Direct3D device\n");
+return AVERROR_UNKNOWN;
+}
+
+return 0;
+}
+
+static int dxva2_device_create9ex(AVHWDeviceContext *ctx, UINT adapter)
+{
+DXVA2DevicePriv *priv = ctx->user_opaque;
+D3DPRESENT_PARAMETERS d3dpp = dxva2_present_params;
+D3DDISPLAYMODEEX modeex = {0};
+IDirect3D9Ex *d3d9ex = NULL;
+IDirect3DDevice9Ex *exdev = NULL;
+HRESULT hr;
+pDirect3DCreate9Ex *createD3DEx = (pDirect3DCreate9Ex 
*)GetProcAddress(priv->d3dlib, "Direct3DCreate9Ex");
+if (!createD3DEx)
+return AVERROR(ENOSYS);
+
+hr = createD3DEx(D3D_SDK_VERSION, );
+if (FAILED(hr))
+return AVERROR_UNKNOWN;
+
+IDirect3D9Ex_GetAdapterDisplayModeEx(d3d9ex, adapter, , NULL);
+
+d3dpp.BackBufferFormat = modeex.Format;
+
+hr = IDirect3D9Ex_CreateDeviceEx(d3d9ex, adapter, D3DDEVTYPE_HAL, 
GetShellWindow(),
+ FF_D3DCREATE_FLAGS,
+ , NULL, );
+if (FAILED(hr)) {
+IDirect3D9Ex_Release(d3d9ex);
+return AVERROR_UNKNOWN;
+}
+
+av_log(ctx, AV_LOG_VERBOSE, "Using D3D9Ex device.\n");
+priv->d3d9 = (IDirect3D9 *)d3d9ex;
+priv->d3d9device = (IDirect3DDevice9 *)exdev;
+return 0;
+}
+
 static int dxva2_device_create(AVHWDeviceContext *ctx, const char *device,
AVDictionary *opts, int flags)
 {
 AVDXVA2DeviceContext *hwctx = ctx->hwctx;
 DXVA2DevicePriv *priv;
-
-pDirect3DCreate9 *createD3D = NULL;
 pCreateDeviceManager9 *createDeviceManager = NULL;
-D3DPRESENT_PARAMETERS d3dpp = {0};
-D3DDISPLAYMODEd3ddm;
 unsigned resetToken = 0;
 UINT adapter = D3DADAPTER_DEFAULT;
 HRESULT hr;
+int err;
 
 if (device)
 adapter = atoi(device);
@@ -448,11 +526,6 @@ static int dxva2_device_create(AVHWDeviceContext *ctx, 
const char *device,
 return AVERROR_UNKNOWN;
 }
 
-createD3D = (pDirect3DCreate9 *)GetProcAddress(priv->d3dlib, 
"Direct3DCreate9");
-if (!createD3D) {
-av_log(ctx, AV_LOG_ERROR, "Failed to locate Direct3DCreate9\n");
-return AVERROR_UNKNOWN;
-}
 createDeviceManager = (pCreateDeviceManager9 
*)GetProcAddress(priv->dxva2lib,
   
"DXVA2CreateDirect3DDeviceManager9");
 if (!createDeviceManager) {
@@ -460,27 +533,11 @@ static int dxva2_device_create(AVHWDeviceContext *ctx, 
const char *device,
 return AVERROR_UNKNOWN;
 }
 
-priv->d3d9 = createD3D(D3D_SDK_VERSION);
-   

Re: [libav-devel] [PATCH 5/5] aarch64: vp9itxfm: Do separate functions for half/quarter idct16 and idct32 (alternative 2)

2017-02-09 Thread Martin Storsjö

On Thu, 9 Feb 2017, Janne Grunau wrote:


On 2017-02-06 00:16:41 +0200, Martin Storsjö wrote:


Ok, so after running a slightly shorter clip (which seems to have about as
large percentage of runtime doing IDCT as the previous one) with a bit more
iterations, I've got the following results (the 'user' part from 'time
avconv -threads 1 -i foo -f null -'):

32 orig   32 alt1   32 alt2   64 orig   64 alt1   64 alt2
40.436s   40.148s   40.008s   37.428s   37.356s   37.192s
40.596s   40.140s   40.216s   37.572s   37.524s   37.384s
40.512s   40.228s   40.188s   37.740s   37.588s   37.368s
40.584s   40.136s   40.216s   37.880s   37.492s   37.348s
40.572s   40.292s   40.232s   37.756s   37.556s   37.676s
40.764s   40.312s   40.232s   37.876s   37.640s   37.468s
40.688s   40.284s   40.368s   37.972s   37.608s   37.460s

So while alt2 is faster in most runs, the margin is not quite as big as in
the previous benchmark. (The benchmarks were done on a practically unloaded
system so it shouldn't vary too much from run to run, but in practice, the
first few runs seem to be slightly faster than the later ones.)

I.e. around 400 ms gain out of 40 s for alt1, and then another -50 - +150 ms
speedup on top of that for alt2.

What do you think?


At least it looks like the difference between alt1 and alt2 are quite 
similar on 32- and 64-bit. So we should use the same variant on both 
archs. I favor alternate 2.


Ok then - I'll try to polish up and push alternative 2 based on the 
feedback I got.


// Martin
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH 5/5] aarch64: vp9itxfm: Do a simpler half/quarter idct16/idct32 when possible (alternative 1)

2017-02-09 Thread Martin Storsjö

On Thu, 9 Feb 2017, Janne Grunau wrote:


On 2017-02-09 09:50:48 +0200, Martin Storsjö wrote:

On Thu, 9 Feb 2017, Janne Grunau wrote:

>On 2017-02-05 14:05:49 +0200, Martin Storsjö wrote:
>>On Sun, 5 Feb 2017, Janne Grunau wrote:
>>
 // out1 = in1 + in2
 // out2 = in1 - in2
 .macro butterfly_8h out1, out2, in1, in2
@@ -463,7 +510,7 @@ function idct16x16_dc_add_neon
 ret
 endfunc

-function idct16
+.macro idct16_full
 dmbutterfly0v16, v24, v16, v24, v2, v3, v4, v5, v6, v7 // v16 
= t0a,  v24 = t1a
 dmbutterfly v20, v28, v0.h[1], v0.h[2], v2, v3, v4, v5 // v20 
= t2a,  v28 = t3a
 dmbutterfly v18, v30, v0.h[3], v0.h[4], v2, v3, v4, v5 // v18 
= t4a,  v30 = t7a
@@ -485,7 +532,10 @@ function idct16
 dmbutterfly0v22, v26, v22, v26, v2, v3, v18, v19, v30, v31 
   // v22 = t6a,  v26 = t5a
 dmbutterfly v23, v25, v0.h[1], v0.h[2], v18, v19, v30, v31 
   // v23 = t9a,  v25 = t14a
 dmbutterfly v27, v21, v0.h[1], v0.h[2], v18, v19, v30, v31, 
neg=1 // v27 = t13a, v21 = t10a
+idct16_end
>>>
>>>I think it would be clearer if idct16_end is used directly from the macro.
>>>it would probably also make sense to move idct16_end and avoid the
>>>idct16_full macro. The patch might be smaller and it is immediately
>>>obvious that there is no code change but the resulting code is more
>>>comlicated than it needs to be. same applies to arm if we go with
>>>alternative 1.
>>
>>Ok, so you mean like this?
>>
>>function idct16
>>dmbutterfly...
>>
>>idct16_end
>>endfunc
>
>that would be one option, the other would be to move the idct_end
>instructions as a macro out of the the existing idct16 function and use it
>as macro. That would make the full idct structural identical to the half
>and quarter version and avoid a macro only used once.

I'm not really following what you're suggesting here - can you outline it
with a code sample like mine above?


sorry, it seems I wasn't fully awake. I misread your code snipped. To 
avoid any confusing here is what I ment outlined as pseudo patch:


@@
+.macro idct16_end
+[code from the existing idct16 function]
+.endm
+
function idct16
@@ ...

+idct16_end
-[code moved to the idct16_end macro]
endfunc


Right - yes, that's exactly what I meant, and what I did locally based on 
your earlier comment.


// Martin
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH 5/5] aarch64: vp9itxfm: Do a simpler half/quarter idct16/idct32 when possible (alternative 1)

2017-02-09 Thread Janne Grunau
On 2017-02-09 09:50:48 +0200, Martin Storsjö wrote:
> On Thu, 9 Feb 2017, Janne Grunau wrote:
> 
> >On 2017-02-05 14:05:49 +0200, Martin Storsjö wrote:
> >>On Sun, 5 Feb 2017, Janne Grunau wrote:
> >>
>  // out1 = in1 + in2
>  // out2 = in1 - in2
>  .macro butterfly_8h out1, out2, in1, in2
> @@ -463,7 +510,7 @@ function idct16x16_dc_add_neon
>  ret
>  endfunc
> 
> -function idct16
> +.macro idct16_full
>  dmbutterfly0v16, v24, v16, v24, v2, v3, v4, v5, v6, v7 // 
>  v16 = t0a,  v24 = t1a
>  dmbutterfly v20, v28, v0.h[1], v0.h[2], v2, v3, v4, v5 // 
>  v20 = t2a,  v28 = t3a
>  dmbutterfly v18, v30, v0.h[3], v0.h[4], v2, v3, v4, v5 // 
>  v18 = t4a,  v30 = t7a
> @@ -485,7 +532,10 @@ function idct16
>  dmbutterfly0v22, v26, v22, v26, v2, v3, v18, v19, v30, v31   
>   // v22 = t6a,  v26 = t5a
>  dmbutterfly v23, v25, v0.h[1], v0.h[2], v18, v19, v30, v31   
>   // v23 = t9a,  v25 = t14a
>  dmbutterfly v27, v21, v0.h[1], v0.h[2], v18, v19, v30, v31, 
>  neg=1 // v27 = t13a, v21 = t10a
> +idct16_end
> >>>
> >>>I think it would be clearer if idct16_end is used directly from the macro.
> >>>it would probably also make sense to move idct16_end and avoid the
> >>>idct16_full macro. The patch might be smaller and it is immediately
> >>>obvious that there is no code change but the resulting code is more
> >>>comlicated than it needs to be. same applies to arm if we go with
> >>>alternative 1.
> >>
> >>Ok, so you mean like this?
> >>
> >>function idct16
> >>dmbutterfly...
> >>
> >>idct16_end
> >>endfunc
> >
> >that would be one option, the other would be to move the idct_end
> >instructions as a macro out of the the existing idct16 function and use it
> >as macro. That would make the full idct structural identical to the half
> >and quarter version and avoid a macro only used once.
> 
> I'm not really following what you're suggesting here - can you outline it
> with a code sample like mine above?

sorry, it seems I wasn't fully awake. I misread your code snipped. To 
avoid any confusing here is what I ment outlined as pseudo patch:

@@
+.macro idct16_end
+[code from the existing idct16 function]
+.endm
+
 function idct16
@@ ...
 
+idct16_end
-[code moved to the idct16_end macro]
 endfunc

Janne
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel