Re: [libav-devel] [PATCH] asfdec: fix reading files larger than 2GB

2017-02-23 Thread Luca Barbato
On 24/02/2017 01:05, John Stebbins wrote:
> avio_skip returns file position and overflows int
> ---
>  libavformat/asfdec.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/libavformat/asfdec.c b/libavformat/asfdec.c
> index 34730b2..10d3396 100644
> --- a/libavformat/asfdec.c
> +++ b/libavformat/asfdec.c
> @@ -976,7 +976,8 @@ static int asf_read_simple_index(AVFormatContext *s, 
> const GUIDParseTable *g)
>  uint64_t interval; // index entry time interval in 100 ns units, usually 
> it's 1s
>  uint32_t pkt_num, nb_entries;
>  int32_t prev_pkt_num = -1;
> -int i, ret;
> +int i;
> +int64_t ret;
>  uint64_t size = avio_rl64(pb);
>  
>  // simple index objects should be ordered by stream number, this loop 
> tries to find
> 

Sounds good, I hadn't look at the code but maybe it might be clearer
using a second variable.

lu
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

[libav-devel] [PATCH] asfdec: fix reading files larger than 2GB

2017-02-23 Thread John Stebbins
avio_skip returns file position and overflows int
---
 libavformat/asfdec.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/libavformat/asfdec.c b/libavformat/asfdec.c
index 34730b2..10d3396 100644
--- a/libavformat/asfdec.c
+++ b/libavformat/asfdec.c
@@ -976,7 +976,8 @@ static int asf_read_simple_index(AVFormatContext *s, const 
GUIDParseTable *g)
 uint64_t interval; // index entry time interval in 100 ns units, usually 
it's 1s
 uint32_t pkt_num, nb_entries;
 int32_t prev_pkt_num = -1;
-int i, ret;
+int i;
+int64_t ret;
 uint64_t size = avio_rl64(pb);
 
 // simple index objects should be ordered by stream number, this loop 
tries to find
-- 
2.9.3

___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH 2/6] arm/aarch64: vp9lpf: Keep the comparison to E within 8 bit

2017-02-23 Thread Martin Storsjö

On Thu, 23 Feb 2017, Janne Grunau wrote:


On 2017-02-11 22:19:02 +0200, Martin Storsjö wrote:

On Fri, 10 Feb 2017, Janne Grunau wrote:

>On 2017-01-15 22:55:48 +0200, Martin Storsjö wrote:
>>The theoretical maximum value of E is 193, so we can just
>>saturate the addition to 255.
>>
>>Before: Cortex A7  A8  A9 A53  A53/AArch64
>>vp9_loop_filter_v_4_8_neon: 143.0   127.7   114.888.0 87.7
>>vp9_loop_filter_v_8_8_neon: 241.0   197.2   173.7   140.0136.7
>>vp9_loop_filter_v_16_8_neon:497.0   419.5   379.7   293.0275.7
>>vp9_loop_filter_v_16_16_neon:   965.2   818.7   731.4   579.0452.0
>>After:
>>vp9_loop_filter_v_4_8_neon: 136.0   125.7   112.684.0 83.0
>>vp9_loop_filter_v_8_8_neon: 234.0   195.5   171.5   136.0133.7
>>vp9_loop_filter_v_16_8_neon:490.0   417.5   377.7   289.0271.0
>>vp9_loop_filter_v_16_16_neon:   951.2   814.7   732.3   571.0446.7
>>---
>> libavcodec/aarch64/vp9lpf_neon.S | 40 
+---
>> libavcodec/arm/vp9lpf_neon.S | 11 +--
>> 2 files changed, 14 insertions(+), 37 deletions(-)
>>
>>diff --git a/libavcodec/aarch64/vp9lpf_neon.S 
b/libavcodec/aarch64/vp9lpf_neon.S
>>index 3b8e6eb..4553173 100644
>>--- a/libavcodec/aarch64/vp9lpf_neon.S
>>+++ b/libavcodec/aarch64/vp9lpf_neon.S
>>@@ -51,13 +51,6 @@
>> // see the arm version instead.
>>
>>
>>-.macro uabdl_sz dst1, dst2, in1, in2, sz
>>-uabdl   \dst1,  \in1\().8b,  \in2\().8b
>>-.ifc \sz, .16b
>>-uabdl2  \dst2,  \in1\().16b, \in2\().16b
>>-.endif
>>-.endm
>>-
>> .macro add_sz dst1, dst2, in1, in2, in3, in4, sz
>> add \dst1,  \in1,  \in3
>> .ifc \sz, .16b
>>@@ -86,20 +79,6 @@
>> .endif
>> .endm
>>
>>-.macro cmhs_sz dst1, dst2, in1, in2, in3, in4, sz
>>-cmhs\dst1,  \in1,  \in3
>>-.ifc \sz, .16b
>>-cmhs\dst2,  \in2,  \in4
>>-.endif
>>-.endm
>>-
>>-.macro xtn_sz dst, in1, in2, sz
>>-xtn \dst\().8b,  \in1
>>-.ifc \sz, .16b
>>-xtn2\dst\().16b, \in2
>>-.endif
>>-.endm
>>-
>> .macro usubl_sz dst1, dst2, in1, in2, sz
>> usubl   \dst1,  \in1\().8b,  \in2\().8b
>> .ifc \sz, .16b
>>@@ -179,20 +158,20 @@
>> // tmpq2 == tmp3 + tmp4, etc.
>> .macro loop_filter wd, sz, mix, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, 
tmp8
>> .if \mix == 0
>>-dup v0.8h,  w2// E
>>-dup v1.8h,  w2// E
>>+dup v0\sz,  w2// E
>> dup v2\sz,  w3// I
>> dup v3\sz,  w4// H
>> .else
>>-dup v0.8h,  w2// E
>>+dup v0.8b,  w2// E
>> dup v2.8b,  w3// I
>> dup v3.8b,  w4// H
>>+lsr w5, w2,  #8
>> lsr w6, w3,  #8
>> lsr w7, w4,  #8
>>-ushrv1.8h,  v0.8h, #8 // E
>>+dup v1.8b,  w5// E
>> dup v4.8b,  w6// I
>>-bic v0.8h,  #255, lsl 8 // E
>> dup v5.8b,  w7// H
>>+trn1v0.2d,  v0.2d,  v1.2d
>
>isn't this equivalent to
>
>dup  v0.8h, w2
>uzp1 v0.16b, v0.16b, v0.16b
>
>on little endian?

Nice idea, but it isn't quite as straightforward on aarch64 - on arm it
would have been.


gah, yes.

All the even values will be output in the output registers of uzp1, so 
you need uzp2 as well.


So instead of this as we have now:

dup  v0.8b, w2
lsr  w5, w2, #8
dup  v1.8b, w5
trn1 v0.2d, v0.2d, v1.2d

We could do:

dup  v0.8h, w2
uzp2 v1.16b, v0.16b, v0.16b
uzp1 v0.16b, v0.16b, v0.16b
trn1 v0.2d, v0.2d, v1.2d


rev16 v1.16b, v0.16b // or ext ..x or any other instruction
uzp1  v0.16b, v0.16b, v1.16b

is one instruction less but also not straight forward


Neat, thanks! This turns out to be one cycle faster in total, and three 
instructions less. I'll push that as a separate patch since it changes the 
existing ones quite a bit as well, not just the registers touched by this 
patch.


// Martin
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH 2/3] configure: Simplify dlopen check

2017-02-23 Thread Janne Grunau
On 2017-02-21 18:26:25 +0100, Diego Biurrun wrote:
> ---
> 
> This was previously approved.
> 
>  configure | 26 +-
>  1 file changed, 9 insertions(+), 17 deletions(-)
> 
> diff --git a/configure b/configure
> index 6f1be32..ef6a8e0 100755
> --- a/configure
> +++ b/configure
> @@ -1608,7 +1608,6 @@ SYSTEM_FUNCS="
>  CommandLineToArgvW
>  CoTaskMemFree
>  CryptGenRandom
> -dlopen
>  fcntl
>  flt_lim
>  fork
> @@ -2218,10 +2217,8 @@ wmv3_vaapi_hwaccel_select="vc1_vaapi_hwaccel"
>  wmv3_vdpau_hwaccel_select="vc1_vdpau_hwaccel"
>  
>  # hardware-accelerated codecs
> -nvenc_deps_any="dlopen LoadLibrary"
> -nvenc_extralibs='$ldl'
> -omx_deps="dlopen pthreads"
> -omx_extralibs='$ldl'
> +nvenc_deps_any="libdl LoadLibrary"
> +omx_deps="libdl pthreads"
>  omx_rpi_select="omx"
>  qsvdec_select="qsv"
>  qsvenc_select="qsv"
> @@ -2280,7 +2277,7 @@ mjpeg2jpeg_bsf_select="jpegtables"
>  
>  # external libraries
>  avisynth_deps="LoadLibrary"
> -avxsynth_deps="dlopen"
> +avxsynth_deps="libdl"
>  avisynth_demuxer_deps_any="avisynth avxsynth"
>  avisynth_demuxer_select="riffdec"
>  libdcadec_decoder_deps="libdcadec"
> @@ -2472,10 +2469,8 @@ deinterlace_qsv_filter_deps="libmfx"
>  deinterlace_vaapi_filter_deps="vaapi"
>  delogo_filter_deps="gpl"
>  drawtext_filter_deps="libfreetype"
> -frei0r_filter_deps="frei0r dlopen"
> -frei0r_filter_extralibs='$ldl'
> -frei0r_src_filter_deps="frei0r dlopen"
> -frei0r_src_filter_extralibs='$ldl'
> +frei0r_filter_deps="frei0r libdl"
> +frei0r_src_filter_deps="frei0r libdl"
>  hdcd_filter_deps="libhdcd"
>  hqdn3d_filter_deps="gpl"
>  interlace_filter_deps="gpl"
> @@ -4461,12 +4456,6 @@ check_code cc arm_neon.h "int16x8_t test = 
> vdupq_n_s16(0)" && enable intrinsics_
>  
>  check_ldflags -Wl,--as-needed
>  
> -if check_func dlopen; then
> -ldl=
> -elif check_func dlopen -ldl; then
> -ldl=-ldl
> -fi
> -
>  if ! disabled network; then
>  check_func getaddrinfo $network_extralibs
>  check_func inet_aton $network_extralibs
> @@ -4638,6 +4627,9 @@ enabled pthreads &&
>  disabled  zlib || check_lib  zlib  zlib.h  zlibVersion -lz
>  disabled bzlib || check_lib bzlib bzlib.h BZ2_bzlibVersion -lbz2
>  
> +# On some systems dynamic loading requires no extra linker flags
> +check_lib libdl dlfcn.h dlopen || check_lib libdl dlfcn.h dlopen -ldl
> +
>  check_lib libm math.h sin -lm && LIBM="-lm"
>  
>  atan2f_args=2
> @@ -4650,7 +4642,7 @@ done
>  
>  # these are off by default, so fail if requested and not available
>  enabled avisynth  && require_header avisynth/avisynth_c.h
> -enabled avxsynth  && require avxsynth "avxsynth/avxsynth_c.h 
> dlfcn.h" dlopen -ldl
> +enabled avxsynth  && require_header avxsynth/avxsynth_c.h
>  enabled cuda  && require cuda cuda.h cuInit -lcuda
>  enabled frei0r&& require_header frei0r.h
>  enabled gnutls&& require_pkg_config gnutls gnutls 
> gnutls/gnutls.h gnutls_global_init

ok

Janne
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH 1/3] Revert "configure: Add proper weak dependency of drawtext filter on libfontconfig"

2017-02-23 Thread Janne Grunau
On 2017-02-21 18:26:24 +0100, Diego Biurrun wrote:
> External dependencies cannot be handled as weak dependencies since they need
> to be explicitly enabled. If a weak dependency is set, the variable 
> corresponding
> to the weak dependency can be enabled without the rest of the build system
> settings, resulting in a failing build.
> 
> This reverts commit 66988320794a107f2a460eaa71dbd9fab8056842.
> ---
>  configure | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/configure b/configure
> index 24e9fc3..6f1be32 100755
> --- a/configure
> +++ b/configure
> @@ -2472,7 +2472,6 @@ deinterlace_qsv_filter_deps="libmfx"
>  deinterlace_vaapi_filter_deps="vaapi"
>  delogo_filter_deps="gpl"
>  drawtext_filter_deps="libfreetype"
> -drawtext_filter_suggest="libfontconfig"
>  frei0r_filter_deps="frei0r dlopen"
>  frei0r_filter_extralibs='$ldl'
>  frei0r_src_filter_deps="frei0r dlopen"

ok

Janne
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH 3/4] arm: vp9itxfm: Reorder iadst16 coeffs

2017-02-23 Thread Janne Grunau
On 2017-02-09 14:33:55 +0200, Martin Storsjö wrote:
> This matches the order they are in the 16 bpp version.
> 
> There they are in this order, to make sure we access them in the
> same order they are declared, easing loading only half of the
> coefficients at a time.
> 
> This makes the 8 bpp version match the 16 bpp version better.
> ---
>  libavcodec/arm/vp9itxfm_neon.S | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S
> index f74d542..c8eeb76 100644
> --- a/libavcodec/arm/vp9itxfm_neon.S
> +++ b/libavcodec/arm/vp9itxfm_neon.S
> @@ -37,8 +37,8 @@ idct_coeffs:
>  endconst
>  
>  const iadst16_coeffs, align=4
> -.short  16364, 804, 15893, 3981, 14811, 7005, 13160, 9760
> -.short  11003, 12140, 8423, 14053, 5520, 15426, 2404, 16207
> +.short  16364, 804, 15893, 3981, 11003, 12140, 8423, 14053
> +.short  14811, 7005, 13160, 9760, 5520, 15426, 2404, 16207
>  endconst
>  
>  @ Do four 4x4 transposes, using q registers for the subtransposes that don't
> @@ -672,19 +672,19 @@ function iadst16
>  vld1.16 {q0-q1}, [r12,:128]
>  
>  mbutterfly_lq3,  q2,  d31, d16, d0[1], d0[0] @ q3  = t1,   q2  = 
> t0
> -mbutterfly_lq5,  q4,  d23, d24, d2[1], d2[0] @ q5  = t9,   q4  = 
> t8
> +mbutterfly_lq5,  q4,  d23, d24, d1[1], d1[0] @ q5  = t9,   q4  = 
> t8
>  butterfly_n d31, d24, q3,  q5,  q6,  q5  @ d31 = t1a,  d24 = 
> t9a
>  mbutterfly_lq7,  q6,  d29, d18, d0[3], d0[2] @ q7  = t3,   q6  = 
> t2
>  butterfly_n d16, d23, q2,  q4,  q3,  q4  @ d16 = t0a,  d23 = 
> t8a
>  
> -mbutterfly_lq3,  q2,  d21, d26, d2[3], d2[2] @ q3  = t11,  q2  = 
> t10
> +mbutterfly_lq3,  q2,  d21, d26, d1[3], d1[2] @ q3  = t11,  q2  = 
> t10
>  butterfly_n d29, d26, q7,  q3,  q4,  q3  @ d29 = t3a,  d26 = 
> t11a
> -mbutterfly_lq5,  q4,  d27, d20, d1[1], d1[0] @ q5  = t5,   q4  = 
> t4
> +mbutterfly_lq5,  q4,  d27, d20, d2[1], d2[0] @ q5  = t5,   q4  = 
> t4
>  butterfly_n d18, d21, q6,  q2,  q3,  q2  @ d18 = t2a,  d21 = 
> t10a
>  
>  mbutterfly_lq7,  q6,  d19, d28, d3[1], d3[0] @ q7  = t13,  q6  = 
> t12
>  butterfly_n d20, d28, q5,  q7,  q2,  q7  @ d20 = t5a,  d28 = 
> t13a
> -mbutterfly_lq3,  q2,  d25, d22, d1[3], d1[2] @ q3  = t7,   q2  = 
> t6
> +mbutterfly_lq3,  q2,  d25, d22, d2[3], d2[2] @ q3  = t7,   q2  = 
> t6
>  butterfly_n d27, d19, q4,  q6,  q5,  q6  @ d27 = t4a,  d19 = 
> t12a
>  
>  mbutterfly_lq5,  q4,  d17, d30, d3[3], d3[2] @ q5  = t15,  q4  = 
> t14

ok

Janne
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH 4/4] aarch64: vp9itxfm: Reorder iadst16 coeffs

2017-02-23 Thread Janne Grunau
On 2017-02-09 14:33:56 +0200, Martin Storsjö wrote:
> This matches the order they are in the 16 bpp version.
> 
> There they are in this order, to make sure we access them in the
> same order they are declared, easing loading only half of the
> coefficients at a time.
> 
> This makes the 8 bpp version match the 16 bpp version better.
> ---
>  libavcodec/aarch64/vp9itxfm_neon.S | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/libavcodec/aarch64/vp9itxfm_neon.S 
> b/libavcodec/aarch64/vp9itxfm_neon.S
> index f87f6bd..7b7dbd4 100644
> --- a/libavcodec/aarch64/vp9itxfm_neon.S
> +++ b/libavcodec/aarch64/vp9itxfm_neon.S
> @@ -37,8 +37,8 @@ idct_coeffs:
>  endconst
>  
>  const iadst16_coeffs, align=4
> -.short  16364, 804, 15893, 3981, 14811, 7005, 13160, 9760
> -.short  11003, 12140, 8423, 14053, 5520, 15426, 2404, 16207
> +.short  16364, 804, 15893, 3981, 11003, 12140, 8423, 14053
> +.short  14811, 7005, 13160, 9760, 5520, 15426, 2404, 16207
>  endconst
>  
>  // out1 = ((in1 + in2) * d0[0] + (1 << 13)) >> 14
> @@ -622,19 +622,19 @@ function iadst16
>  ld1 {v0.8h,v1.8h}, [x11]
>  
>  dmbutterfly_l   v6,  v7,  v4,  v5,  v31, v16, v0.h[1], v0.h[0]   // 
> v6,v7   = t1,   v4,v5   = t0
> -dmbutterfly_l   v10, v11, v8,  v9,  v23, v24, v1.h[1], v1.h[0]   // 
> v10,v11 = t9,   v8,v9   = t8
> +dmbutterfly_l   v10, v11, v8,  v9,  v23, v24, v0.h[5], v0.h[4]   // 
> v10,v11 = t9,   v8,v9   = t8
>  dbutterfly_nv31, v24, v6,  v7,  v10, v11, v12, v13, v10, v11 // 
> v31 = t1a,  v24 = t9a
>  dmbutterfly_l   v14, v15, v12, v13, v29, v18, v0.h[3], v0.h[2]   // 
> v14,v15 = t3,   v12,v13 = t2
>  dbutterfly_nv16, v23, v4,  v5,  v8,  v9,  v6,  v7,  v8,  v9  // 
> v16 = t0a,  v23 = t8a
>  
> -dmbutterfly_l   v6,  v7,  v4,  v5,  v21, v26, v1.h[3], v1.h[2]   // 
> v6,v7   = t11,  v4,v5   = t10
> +dmbutterfly_l   v6,  v7,  v4,  v5,  v21, v26, v0.h[7], v0.h[6]   // 
> v6,v7   = t11,  v4,v5   = t10
>  dbutterfly_nv29, v26, v14, v15, v6,  v7,  v8,  v9,  v6,  v7  // 
> v29 = t3a,  v26 = t11a
> -dmbutterfly_l   v10, v11, v8,  v9,  v27, v20, v0.h[5], v0.h[4]   // 
> v10,v11 = t5,   v8,v9   = t4
> +dmbutterfly_l   v10, v11, v8,  v9,  v27, v20, v1.h[1], v1.h[0]   // 
> v10,v11 = t5,   v8,v9   = t4
>  dbutterfly_nv18, v21, v12, v13, v4,  v5,  v6,  v7,  v4,  v5  // 
> v18 = t2a,  v21 = t10a
>  
>  dmbutterfly_l   v14, v15, v12, v13, v19, v28, v1.h[5], v1.h[4]   // 
> v14,v15 = t13,  v12,v13 = t12
>  dbutterfly_nv20, v28, v10, v11, v14, v15, v4,  v5,  v14, v15 // 
> v20 = t5a,  v28 = t13a
> -dmbutterfly_l   v6,  v7,  v4,  v5,  v25, v22, v0.h[7], v0.h[6]   // 
> v6,v7   = t7,   v4,v5   = t6
> +dmbutterfly_l   v6,  v7,  v4,  v5,  v25, v22, v1.h[3], v1.h[2]   // 
> v6,v7   = t7,   v4,v5   = t6
>  dbutterfly_nv27, v19, v8,  v9,  v12, v13, v10, v11, v12, v13 // 
> v27 = t4a,  v19 = t12a
>  
>  dmbutterfly_l   v10, v11, v8,  v9,  v17, v30, v1.h[7], v1.h[6]   // 
> v10,v11 = t15,  v8,v9   = t14

ok

Janne
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH 2/4] aarch64: vp9itxfm: Reorder the idct coefficients for better pairing

2017-02-23 Thread Janne Grunau
On 2017-02-09 14:33:54 +0200, Martin Storsjö wrote:
> All elements are used pairwise, except for the first one.
> Previously, the 16th element was unused. Move the unused element
> to the second slot, to make the later element pairs not split
> across registers.
> 
> This simplifies loading only parts of the coefficients,
> reducing the difference to the 16 bpp version.
> ---
>  libavcodec/aarch64/vp9itxfm_neon.S | 124 
> ++---
>  1 file changed, 62 insertions(+), 62 deletions(-)

ok

Janne
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH 1/4] arm: vp9itxfm: Reorder the idct coefficients for better pairing

2017-02-23 Thread Janne Grunau
On 2017-02-09 14:33:53 +0200, Martin Storsjö wrote:
> All elements are used pairwise, except for the first one.
> Previously, the 16th element was unused. Move the unused element
> to the second slot, to make the later element pairs not split
> across registers.
> 
> This simplifies loading only parts of the coefficients,
> reducing the difference to the 16 bpp version.
> ---
> The 16 bpp version is only in ffmpeg for now, since libav's vp9
> decoder doesn't support the high bitdepth profiles. This change
> in itself still makes sense to do though.
> ---
>  libavcodec/arm/vp9itxfm_neon.S | 124 
> -
>  1 file changed, 62 insertions(+), 62 deletions(-)

ok

Janne
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH] arm: vp9itxfm: Avoid reloading the idct32 coefficients

2017-02-23 Thread Janne Grunau
On 2017-02-09 13:39:55 +0200, Martin Storsjö wrote:
> The idct32x32 function actually backed up and restored q4-q7 even
> though it didn't clobber them; there are plenty of registers that
> can be used to allow keeping all the idct coefficients in registers
> without having to reload different subsets of them at different
> stages in the transform.
> 
> Since the idct16 core transform avoids clobbering q4-q7 (but clobbers
> q2-q3 instead, to avoid needing to back up and restore q4-q7 at all
> in the idct16 function), and the lanewise vmul needs a register in
> the q0-q3 range, we move the stored coefficients from q2-q3 into q4-q5
> while doing idct16.
> 
> While keeping these coefficients in registers, we still can skip backing
> up and restoring q7.
> 
> Before:  Cortex A7   A8   A9  A53
> vp9_inv_dct_dct_32x32_sub32_add_neon:  18553.8  17182.7  14303.3  12089.7
> After:
> vp9_inv_dct_dct_32x32_sub32_add_neon:  18470.3  16717.7  14173.6  11860.8
> ---
>  libavcodec/arm/vp9itxfm_neon.S | 246 
> -
>  1 file changed, 120 insertions(+), 126 deletions(-)

ok

Janne
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH 6/6] arm: vp9lpf: Implement the mix2_44 function with one single filter pass

2017-02-23 Thread Janne Grunau
On 2017-02-11 23:42:05 +0200, Martin Storsjö wrote:
> On Sat, 11 Feb 2017, Martin Storsjö wrote:
> 
> >On Fri, 10 Feb 2017, Janne Grunau wrote:
> >
> >>On 2017-01-15 22:55:52 +0200, Martin Storsjö wrote:
> >>>For this case, with 8 inputs but only changing 4 of them, we can fit
> >>>all 16 input pixels into a q register, and still have enough temporary
> >>>registers for doing the loop filter.
> >>>
> >>>The wd=8 filters would require too many temporary registers for
> >>>processing all 16 pixels at once though.
> >>>
> >>>Before:  Cortex A7  A8 A9 A53
> >>>vp9_loop_filter_mix2_v_44_16_neon:   289.7   256.2  237.5   181.2
> >>>After:
> >>>vp9_loop_filter_mix2_v_44_16_neon:   221.2   150.5  177.7   138.0
> >>>---
> >>> libavcodec/arm/vp9dsp_init_arm.c |   7 +-
> >>> libavcodec/arm/vp9lpf_neon.S | 191
> >+++
> >>> 2 files changed, 195 insertions(+), 3 deletions(-)
> >>>
> >>>diff --git a/libavcodec/arm/vp9dsp_init_arm.c
> >b/libavcodec/arm/vp9dsp_init_arm.c
> >>>index e99d931..1ede170 100644
> >>>--- a/libavcodec/arm/vp9dsp_init_arm.c
> >>>+++ b/libavcodec/arm/vp9dsp_init_arm.c
> >>>@@ -194,6 +194,8 @@ define_loop_filters(8, 8);
> >>> define_loop_filters(16, 8);
> >>> define_loop_filters(16, 16);
> >>>
> >>>+define_loop_filters(44, 16);
> >>>+
> >>> #define lf_mix_fn(dir, wd1, wd2, stridea)
> >\
> >>> static void loop_filter_##dir##_##wd1##wd2##_16_neon(uint8_t *dst,
> >\
> >>>  ptrdiff_t
> >>>stride,
> >\
> >>>@@ -207,7 +209,6 @@ static void
> >loop_filter_##dir##_##wd1##wd2##_16_neon(uint8_t *dst,
> >>> lf_mix_fn(h, wd1, wd2, stride) \
> >>> lf_mix_fn(v, wd1, wd2, sizeof(uint8_t))
> >>>
> >>>-lf_mix_fns(4, 4)
> >>> lf_mix_fns(4, 8)
> >>> lf_mix_fns(8, 4)
> >>> lf_mix_fns(8, 8)
> >>>@@ -227,8 +228,8 @@ static av_cold void
> >vp9dsp_loopfilter_init_arm(VP9DSPContext *dsp)
> >>> dsp->loop_filter_16[0] = ff_vp9_loop_filter_h_16_16_neon;
> >>> dsp->loop_filter_16[1] = ff_vp9_loop_filter_v_16_16_neon;
> >>>
> >>>-dsp->loop_filter_mix2[0][0][0] = loop_filter_h_44_16_neon;
> >>>-dsp->loop_filter_mix2[0][0][1] = loop_filter_v_44_16_neon;
> >>>+dsp->loop_filter_mix2[0][0][0] = ff_vp9_loop_filter_h_44_16_neon;
> >>>+dsp->loop_filter_mix2[0][0][1] = ff_vp9_loop_filter_v_44_16_neon;
> >>> dsp->loop_filter_mix2[0][1][0] = loop_filter_h_48_16_neon;
> >>> dsp->loop_filter_mix2[0][1][1] = loop_filter_v_48_16_neon;
> >>> dsp->loop_filter_mix2[1][0][0] = loop_filter_h_84_16_neon;
> >>>diff --git a/libavcodec/arm/vp9lpf_neon.S b/libavcodec/arm/vp9lpf_neon.S
> >>>index e31c807..12984a9 100644
> >>>--- a/libavcodec/arm/vp9lpf_neon.S
> >>>+++ b/libavcodec/arm/vp9lpf_neon.S
> >>>@@ -44,6 +44,109 @@
> >>> vtrn.8  \r2,  \r3
> >>> .endm
> >>>
> >>>+@ The input to and output from this macro is in the registers q8-q15,
> >>>+@ and q0-q7 are used as scratch registers.
> >>>+@ p3 = q8, p0 = q11, q0 = q12, q3 = q15
> >>>+.macro loop_filter_q
> >>>+vdup.u8 d0,  r2  @ E
> >>>+lsr r2,  r2,  #8
> >>>+vdup.u8 d2,  r3  @ I
> >>>+lsr r3,  r3,  #8
> >>>+vdup.u8 d1,  r2  @ E
> >>>+vdup.u8 d3,  r3  @ I
> >
> >I tried implementing your suggestion with uzp here, but it ended up being
> >slower actually. With the version of the patch I posted here:
> >
> >vp9_loop_filter_mix2_v_44_16_neon:   221.2   150.5  185.0   139.0
> >
> >With this block replaced with this:
> >
> >vdup.u16q0,  r2  @ E
> >vdup.u16q1,  r3  @ I
> >vuzp.u8 d0,  d1  @ E
> >vuzp.u8 d2,  d3  @ I
> >
> >I get the following:
> >
> >vp9_loop_filter_mix2_v_44_16_neon:   223.2   150.5  186.1   142.0
> >
> >I.e. 1-3 cycles slower on A7, A9 and A53, identical on A8.
> 
> If I move the two vuzp further down, I get the following:
> 
> vp9_loop_filter_mix2_v_44_16_neon:   223.2   148.5  185.1   141.0
> 
> I.e. +2 on A7, -2 on A8, 0 on A9, +2 on A53. So on average it's still worse,
> even though it codewise is neater.

leave it as it was then

Janne
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH 2/6] arm/aarch64: vp9lpf: Keep the comparison to E within 8 bit

2017-02-23 Thread Janne Grunau
On 2017-02-11 22:19:02 +0200, Martin Storsjö wrote:
> On Fri, 10 Feb 2017, Janne Grunau wrote:
> 
> >On 2017-01-15 22:55:48 +0200, Martin Storsjö wrote:
> >>The theoretical maximum value of E is 193, so we can just
> >>saturate the addition to 255.
> >>
> >>Before: Cortex A7  A8  A9 A53  A53/AArch64
> >>vp9_loop_filter_v_4_8_neon: 143.0   127.7   114.888.0 87.7
> >>vp9_loop_filter_v_8_8_neon: 241.0   197.2   173.7   140.0136.7
> >>vp9_loop_filter_v_16_8_neon:497.0   419.5   379.7   293.0275.7
> >>vp9_loop_filter_v_16_16_neon:   965.2   818.7   731.4   579.0452.0
> >>After:
> >>vp9_loop_filter_v_4_8_neon: 136.0   125.7   112.684.0 83.0
> >>vp9_loop_filter_v_8_8_neon: 234.0   195.5   171.5   136.0133.7
> >>vp9_loop_filter_v_16_8_neon:490.0   417.5   377.7   289.0271.0
> >>vp9_loop_filter_v_16_16_neon:   951.2   814.7   732.3   571.0446.7
> >>---
> >> libavcodec/aarch64/vp9lpf_neon.S | 40 
> >> +---
> >> libavcodec/arm/vp9lpf_neon.S | 11 +--
> >> 2 files changed, 14 insertions(+), 37 deletions(-)
> >>
> >>diff --git a/libavcodec/aarch64/vp9lpf_neon.S 
> >>b/libavcodec/aarch64/vp9lpf_neon.S
> >>index 3b8e6eb..4553173 100644
> >>--- a/libavcodec/aarch64/vp9lpf_neon.S
> >>+++ b/libavcodec/aarch64/vp9lpf_neon.S
> >>@@ -51,13 +51,6 @@
> >> // see the arm version instead.
> >>
> >>
> >>-.macro uabdl_sz dst1, dst2, in1, in2, sz
> >>-uabdl   \dst1,  \in1\().8b,  \in2\().8b
> >>-.ifc \sz, .16b
> >>-uabdl2  \dst2,  \in1\().16b, \in2\().16b
> >>-.endif
> >>-.endm
> >>-
> >> .macro add_sz dst1, dst2, in1, in2, in3, in4, sz
> >> add \dst1,  \in1,  \in3
> >> .ifc \sz, .16b
> >>@@ -86,20 +79,6 @@
> >> .endif
> >> .endm
> >>
> >>-.macro cmhs_sz dst1, dst2, in1, in2, in3, in4, sz
> >>-cmhs\dst1,  \in1,  \in3
> >>-.ifc \sz, .16b
> >>-cmhs\dst2,  \in2,  \in4
> >>-.endif
> >>-.endm
> >>-
> >>-.macro xtn_sz dst, in1, in2, sz
> >>-xtn \dst\().8b,  \in1
> >>-.ifc \sz, .16b
> >>-xtn2\dst\().16b, \in2
> >>-.endif
> >>-.endm
> >>-
> >> .macro usubl_sz dst1, dst2, in1, in2, sz
> >> usubl   \dst1,  \in1\().8b,  \in2\().8b
> >> .ifc \sz, .16b
> >>@@ -179,20 +158,20 @@
> >> // tmpq2 == tmp3 + tmp4, etc.
> >> .macro loop_filter wd, sz, mix, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, 
> >> tmp8
> >> .if \mix == 0
> >>-dup v0.8h,  w2// E
> >>-dup v1.8h,  w2// E
> >>+dup v0\sz,  w2// E
> >> dup v2\sz,  w3// I
> >> dup v3\sz,  w4// H
> >> .else
> >>-dup v0.8h,  w2// E
> >>+dup v0.8b,  w2// E
> >> dup v2.8b,  w3// I
> >> dup v3.8b,  w4// H
> >>+lsr w5, w2,  #8
> >> lsr w6, w3,  #8
> >> lsr w7, w4,  #8
> >>-ushrv1.8h,  v0.8h, #8 // E
> >>+dup v1.8b,  w5// E
> >> dup v4.8b,  w6// I
> >>-bic v0.8h,  #255, lsl 8 // E
> >> dup v5.8b,  w7// H
> >>+trn1v0.2d,  v0.2d,  v1.2d
> >
> >isn't this equivalent to
> >
> >dup  v0.8h, w2
> >uzp1 v0.16b, v0.16b, v0.16b
> >
> >on little endian?
> 
> Nice idea, but it isn't quite as straightforward on aarch64 - on arm it
> would have been.

gah, yes.

> All the even values will be output in the output registers of uzp1, so 
> you need uzp2 as well.
> 
> So instead of this as we have now:
> 
> dup  v0.8b, w2
> lsr  w5, w2, #8
> dup  v1.8b, w5
> trn1 v0.2d, v0.2d, v1.2d
> 
> We could do:
> 
> dup  v0.8h, w2
> uzp2 v1.16b, v0.16b, v0.16b
> uzp1 v0.16b, v0.16b, v0.16b
> trn1 v0.2d, v0.2d, v1.2d

rev16 v1.16b, v0.16b // or ext ..x or any other instruction
uzp1  v0.16b, v0.16b, v1.16b

is one instruction less but also not straight forward

ok as is

Janne
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH 1/3] movenc: Add an option for enabling negative CTS offsets

2017-02-23 Thread Martin Storsjö

On Thu, 23 Feb 2017, Yusuke Nakamura wrote:


2017-02-20 6:22 GMT+09:00 Martin Storsjö :


This reduces the need for an edit list; streams that start with
e.g. dts=-1, pts=0 can be encoded as dts=0, pts=0 (which is valid
in mov/mp4) by shifting the dts values of all packets forward.
This avoids the need for edit lists for such streams (while they
still are needed for audio streams with encoder delay).
---
 libavformat/movenc.c | 24 
 libavformat/movenc.h |  2 ++
 2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/libavformat/movenc.c b/libavformat/movenc.c
index 840190d..713c145 100644
--- a/libavformat/movenc.c
+++ b/libavformat/movenc.c
@@ -62,6 +62,7 @@ static const AVOption options[] = {
 { "delay_moov", "Delay writing the initial moov until the first
fragment is cut, or until the first fragment flush", 0, AV_OPT_TYPE_CONST,
{.i64 = FF_MOV_FLAG_DELAY_MOOV}, INT_MIN, INT_MAX,
AV_OPT_FLAG_ENCODING_PARAM, "movflags" },
 { "global_sidx", "Write a global sidx index at the start of the
file", 0, AV_OPT_TYPE_CONST, {.i64 = FF_MOV_FLAG_GLOBAL_SIDX}, INT_MIN,
INT_MAX, AV_OPT_FLAG_ENCODING_PARAM, "movflags" },
 { "skip_trailer", "Skip writing the mfra/tfra/mfro trailer for
fragmented files", 0, AV_OPT_TYPE_CONST, {.i64 = FF_MOV_FLAG_SKIP_TRAILER},
INT_MIN, INT_MAX, AV_OPT_FLAG_ENCODING_PARAM, "movflags" },
+{ "negative_cts_offsets", "Use negative CTS offsets (reducing the
need for edit lists)", 0, AV_OPT_TYPE_CONST, {.i64 =
FF_MOV_FLAG_NEGATIVE_CTS_OFFSETS}, INT_MIN, INT_MAX,
AV_OPT_FLAG_ENCODING_PARAM, "movflags" },
 FF_RTP_FLAG_OPTS(MOVMuxContext, rtp_flags),
 { "skip_iods", "Skip writing iods atom.", offsetof(MOVMuxContext,
iods_skip), AV_OPT_TYPE_INT, {.i64 = 0}, 0, 1, AV_OPT_FLAG_ENCODING_PARAM},
 { "iods_audio_profile", "iods audio profile atom.",
offsetof(MOVMuxContext, iods_audio_profile), AV_OPT_TYPE_INT, {.i64 = -1},
-1, 255, AV_OPT_FLAG_ENCODING_PARAM},
@@ -1163,8 +1164,9 @@ static int mov_write_stsd_tag(AVFormatContext *s,
AVIOContext *pb, MOVTrack *tra
 return update_size(pb, pos);
 }

-static int mov_write_ctts_tag(AVIOContext *pb, MOVTrack *track)
+static int mov_write_ctts_tag(AVFormatContext *s, AVIOContext *pb,
MOVTrack *track)
 {
+MOVMuxContext *mov = s->priv_data;
 MOVStts *ctts_entries;
 uint32_t entries = 0;
 uint32_t atom_size;
@@ -1188,7 +1190,11 @@ static int mov_write_ctts_tag(AVIOContext *pb,
MOVTrack *track)
 atom_size = 16 + (entries * 8);
 avio_wb32(pb, atom_size); /* size */
 ffio_wfourcc(pb, "ctts");
-avio_wb32(pb, 0); /* version & flags */
+if (mov->flags & FF_MOV_FLAG_NEGATIVE_CTS_OFFSETS)
+avio_w8(pb, 1); /* version */



ctts ver. 1 is defined in iso4 or later isobmff brands.


Thanks, will change so that we declare iso4 as major brand if this flag is 
set (unless some other option is set that requires declaring iso5).



+else
+avio_w8(pb, 0); /* version */
+avio_wb24(pb, 0); /* flags */
 avio_wb32(pb, entries); /* entry count */
 for (i = 0; i < entries; i++) {
 avio_wb32(pb, ctts_entries[i].count);
@@ -1273,7 +1279,7 @@ static int mov_write_stbl_tag(AVFormatContext *s,
AVIOContext *pb, MOVTrack *tra
 mov_write_stss_tag(pb, track, MOV_PARTIAL_SYNC_SAMPLE);
 if (track->par->codec_type == AVMEDIA_TYPE_VIDEO &&
 track->flags & MOV_TRACK_CTTS && track->entry)
-mov_write_ctts_tag(pb, track);
+mov_write_ctts_tag(s, pb, track);
 mov_write_stsc_tag(pb, track);
 mov_write_stsz_tag(pb, track);
 mov_write_stco_tag(pb, track);
@@ -2594,7 +2600,10 @@ static int mov_write_trun_tag(AVIOContext *pb,
MOVMuxContext *mov,

 avio_wb32(pb, 0); /* size placeholder */
 ffio_wfourcc(pb, "trun");
-avio_w8(pb, 0); /* version */
+if (mov->flags & FF_MOV_FLAG_NEGATIVE_CTS_OFFSETS)
+avio_w8(pb, 1); /* version */
+else
+avio_w8(pb, 0); /* version */
 avio_wb24(pb, flags);

 avio_wb32(pb, end - first); /* sample count */
@@ -3729,6 +3738,12 @@ static int mov_write_packet(AVFormatContext *s,
AVPacket *pkt)
 mov->flags &= ~FF_MOV_FLAG_FRAG_DISCONT;
 }

+if (mov->flags & FF_MOV_FLAG_NEGATIVE_CTS_OFFSETS) {
+if (trk->dts_shift == AV_NOPTS_VALUE)
+trk->dts_shift = pkt->pts - pkt->dts;



Do you care about an issue of negative composition time offset on early
flush of movie fragments? Reordering of leading samples could confuse
demuxers due to the non-zero cts of the first sample and no examination
about subsequent samples. This can be occured when starting to remux from
Open-GOP boundary (also, don't forget that AVC and HEVC can output P or B
pictures before IDR picture).


Good point - I hadn't thought about that. In those cases, we won't get 
exactly the desired result. On the other hand, I don't have any better 
idea on heuristics that would do the right thing either. So I'd declare 
that as a known 

Re: [libav-devel] [PATCH 3/3] Add Apple Pixlet decoder

2017-02-23 Thread Diego Biurrun
On Wed, Feb 22, 2017 at 12:53:35PM -0500, Vittorio Giovara wrote:
> --- /dev/null
> +++ b/libavcodec/pixlet.c
> @@ -0,0 +1,689 @@
> +static int read_high_coeffs(AVCodecContext *avctx, uint8_t *src, int16_t 
> *dst,
> +int size, int64_t c, int a, int64_t d,
> +int width, ptrdiff_t stride)
> +{
> +PixletContext *ctx = avctx->priv_data;
> +BitstreamContext *bc = >bc;
> +unsigned cnt1, shbits, rlen, nbits, length, i = 0, j = 0, k;
> +int ret, escape, pfx, value, yflag, xflag, flag = 0;
> +int64_t state = 3, tmp;
> +
> +while (i < size) {
> +if (state >> 8 != -3) {
> +value = ff_clz((state >> 8) + 3) ^ 0x1F;
> +} else {
> +value = -1;
> +}

nit: pointless ()

> +cnt1 = get_unary(bc, 0, length);
> +if (cnt1 >= length) {
> +cnt1 = bitstream_read(bc, nbits);
> +} else {
> +pfx= 14 + uint64_t) (value - 14)) >> 32) & (value - 14));

Maybe just make value uint64_t instead of casting?

> +static int read_highpass(AVCodecContext *avctx, uint8_t *ptr,
> + int plane, AVFrame *frame)
> +{
> +for (i = 0; i < ctx->levels * 3; i++) {
> +uint32_t magic = bytestream2_get_be32(>gb);
> +
> +if (magic != PIXLET_MAGIC) {
> +av_log(avctx, AV_LOG_ERROR,
> +   "wrong magic number: 0x%08X for plane %d, band %d\n",
> +   magic, plane, i);

magic is uint32_t, use the correct C99 printf conversion specifier.

> +static int pixlet_decode_frame(AVCodecContext *avctx, void *data,
> +   int *got_frame, AVPacket *avpkt)
> +{
> +uint32_t pktsize;
> +
> +pktsize = bytestream2_get_be32(>gb);
> +if (pktsize <= 44 || pktsize - 4 > bytestream2_get_bytes_left(>gb)) 
> {
> +av_log(avctx, AV_LOG_ERROR, "Invalid packet size %u.\n", pktsize);

same

Diego
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH 3/3] Add Apple Pixlet decoder

2017-02-23 Thread Luca Barbato
On 22/02/2017 18:53, Vittorio Giovara wrote:
> +/* elenril reads this as if (cthulhu->state == fhtagn) */
> +if ((a >= 0) + (a ^ (a >> 31)) - (a >> 31) != 1) {
> +nbits = 33 - ff_clz((a >= 0) + (a ^ (a >> 31)) - (a >> 31) - 1);
> +if (nbits > 16)
> +return AVERROR_INVALIDDATA;
> +} else {
> +nbits = 1;
> +}


cthulu = (a >= 0) + (a ^ (a >> 31)) - (a >> 31);
if (cthulu != 1) {
nbits = 33 - ff_clz(cthulu - 1);

...


The rest looks fine.

lu
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH 2/3] libavutil: add av_mod_uintp2

2017-02-23 Thread Luca Barbato
On 22/02/2017 18:53, Vittorio Giovara wrote:
> From: James Almer 
> 
> Signed-off-by: James Almer 
> ---
>  libavutil/common.h | 14 ++
>  1 file changed, 14 insertions(+)
> 

Ok.

___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCH 1/3] intmath: add faster clz support

2017-02-23 Thread Luca Barbato
On 22/02/2017 18:53, Vittorio Giovara wrote:
> From: Ganesh Ajjanagadde 
> 
> ---
>  libavutil/intmath.h | 19 +++
>  1 file changed, 19 insertions(+)
> 

Sure

___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel