Re: [FFmpeg-devel] Various build errors with armasm64 and armasm after update to FFmpeg 4.2

2019-10-02 Thread Martin Storsjö
> On Oct 1, 2019, at 23:07, Lukas Fellechner wrote: > > This has worked very well for quite a long time. But after upgrading to > FFmpeg 4.2, the build fails. A lot of changes and additions have been done > for ARM/NEON 64-bit, and it looks like many of them are not compatible with >

Re: [FFmpeg-devel] [PATCH] Fix gas-preprocessor to translate .rdata sections for armasm and armasm64

2019-10-02 Thread Martin Storsjö
> On Oct 1, 2019, at 21:37, Lukas Fellechner wrote: > > Compiling FFmpeg with gas-preprocessor.pl and armasm or armasm64 fails since > FFmpeg 4.2. > > New .rdata sections have been added in ARM NEON assembly code (e.g. > libavutil/aarch64/asm.S). > This fix allows gas-preprocessor to

Re: [FFmpeg-devel] [PATCH 2/2] libx264: Pass the reordered_opaque field through the encoder

2018-11-05 Thread Martin Storsjö
On Thu, 1 Nov 2018, Martin Storsjö wrote: On Thu, 1 Nov 2018, Derek Buitenhuis wrote: On 31/10/2018 21:41, Martin Storsjö wrote: Even though we do allow reconfiguration, it doesn't look like we support changing any parameters which would actually affect the delay, only RC method and targets

Re: [FFmpeg-devel] [PATCH 2/2] libx264: Pass the reordered_opaque field through the encoder

2018-11-01 Thread Martin Storsjö
On Thu, 1 Nov 2018, Derek Buitenhuis wrote: On 31/10/2018 21:41, Martin Storsjö wrote: Even though we do allow reconfiguration, it doesn't look like we support changing any parameters which would actually affect the delay, only RC method and targets (CRF, bitrate, etc). So given

Re: [FFmpeg-devel] [PATCH 2/2] libx264: Pass the reordered_opaque field through the encoder

2018-10-31 Thread Martin Storsjö
On Wed, 31 Oct 2018, Derek Buitenhuis wrote: On 30/10/2018 19:49, Martin Storsjö wrote: Hmm, that might make sense, but with a little twist. The max reordered frames for H.264 is known, but onto that you also get more delay due to frame threads and other details that this function within x264

Re: [FFmpeg-devel] [PATCH 2/2] libx264: Pass the reordered_opaque field through the encoder

2018-10-30 Thread Martin Storsjö
On Tue, 30 Oct 2018, Derek Buitenhuis wrote: On 29/10/2018 21:06, Martin Storsjö wrote: As I guess there can be old frames in flight, the only safe option is to enlarge, not to shrink it. But in case a realloc moves the array, the old pointers end up pretty useless. Just always allocate

Re: [FFmpeg-devel] [PATCH 2/2] libx264: Pass the reordered_opaque field through the encoder

2018-10-29 Thread Martin Storsjö
On Mon, 29 Oct 2018, Derek Buitenhuis wrote: On 25/10/2018 13:58, Martin Storsjö wrote: +x4->nb_reordered_opaque = x264_encoder_maximum_delayed_frames(x4->enc) + 1; Is it possible this changes when the encoder is reconfigured (e.g. to interlaced)? Good point. I'm sure it's po

Re: [FFmpeg-devel] [PATCH 1/2] libavutil: Undeprecate the AVFrame reordered_opaque field

2018-10-29 Thread Martin Storsjö
On Mon, 29 Oct 2018, Derek Buitenhuis wrote: On 29/10/2018 14:10, Martin Storsjö wrote: I don't understand why this is being used in favour of a proper pointer field? An integer field is just ascting to be misused. Even the doxygen is really sketchy on it. It's essentially meant to be used

Re: [FFmpeg-devel] [PATCH 1/2] libavutil: Undeprecate the AVFrame reordered_opaque field

2018-10-29 Thread Martin Storsjö
On Mon, 29 Oct 2018, Derek Buitenhuis wrote: On 25/10/2018 13:58, Martin Storsjö wrote: This was marked as deprecated (but only in the doxygen, not with an actual deprecation attribute) in 81c623fae05 in 2011, but was undeprecated in ad1ee5fa7. --- libavutil/frame.h | 1 - libavutil

[FFmpeg-devel] [PATCH 2/2] flvdec: Export unknown metadata packets as opaque data

2018-10-28 Thread Martin Storsjö
--- Removed the option and made this behaviour the default. --- libavformat/flv.h| 1 + libavformat/flvdec.c | 18 ++ 2 files changed, 15 insertions(+), 4 deletions(-) diff --git a/libavformat/flv.h b/libavformat/flv.h index 3aabb3adc9..3571b90279 100644 ---

Re: [FFmpeg-devel] [PATCH 2/2] flvdec: Add an option for exporting unknown metadata packets as opaque data

2018-10-28 Thread Martin Storsjö
On Sun, 28 Oct 2018, Michael Niedermayer wrote: On Sat, Oct 27, 2018 at 09:22:18PM +0300, Martin Storsjö wrote: On Sat, 27 Oct 2018, Michael Niedermayer wrote: On Thu, Oct 25, 2018 at 03:59:17PM +0300, Martin Storsjö wrote: --- libavformat/flv.h| 1 + libavformat/flvdec.c | 21

Re: [FFmpeg-devel] [PATCH 2/2] flvdec: Add an option for exporting unknown metadata packets as opaque data

2018-10-27 Thread Martin Storsjö
On Sat, 27 Oct 2018, Michael Niedermayer wrote: On Thu, Oct 25, 2018 at 03:59:17PM +0300, Martin Storsjö wrote: --- libavformat/flv.h| 1 + libavformat/flvdec.c | 21 + 2 files changed, 18 insertions(+), 4 deletions(-) [...] @@ -1290,6 +1302,7 @@ static const

[FFmpeg-devel] [PATCH 1/2] libavutil: Undeprecate the AVFrame reordered_opaque field

2018-10-25 Thread Martin Storsjö
This was marked as deprecated (but only in the doxygen, not with an actual deprecation attribute) in 81c623fae05 in 2011, but was undeprecated in ad1ee5fa7. --- libavutil/frame.h | 1 - libavutil/version.h | 2 +- 2 files changed, 1 insertion(+), 2 deletions(-) diff --git a/libavutil/frame.h

[FFmpeg-devel] [PATCH 2/2] libx264: Pass the reordered_opaque field through the encoder

2018-10-25 Thread Martin Storsjö
libx264 does have a field for opaque data to pass along with frames through the encoder, but it is a pointer, while the libavcodec reordered_opaque field is an int64_t. Therefore, allocate an array within the libx264 wrapper, where reordered_opaque values in flight are stored, and pass a pointer

[FFmpeg-devel] [PATCH 2/2] flvdec: Add an option for exporting unknown metadata packets as opaque data

2018-10-25 Thread Martin Storsjö
--- libavformat/flv.h| 1 + libavformat/flvdec.c | 21 + 2 files changed, 18 insertions(+), 4 deletions(-) diff --git a/libavformat/flv.h b/libavformat/flv.h index 3aabb3adc9..3571b90279 100644 --- a/libavformat/flv.h +++ b/libavformat/flv.h @@ -66,6 +66,7 @@ enum {

[FFmpeg-devel] [PATCH 1/2] flvdec: Rename FLV_STREAM_TYPE_DATA into FLV_STREAM_TYPE_SUBTITLE

2018-10-25 Thread Martin Storsjö
This is always treated as a subtitle at the moment anyway. --- libavformat/flv.h| 2 +- libavformat/flvdec.c | 8 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/libavformat/flv.h b/libavformat/flv.h index df5ce3d17f..3aabb3adc9 100644 --- a/libavformat/flv.h +++

Re: [FFmpeg-devel] [PATCH 3/3] arm: hevcdsp: Avoid using macro expansion counters

2018-03-31 Thread Martin Storsjö
On Sat, 31 Mar 2018, Hendrik Leppkes wrote: On Fri, Mar 30, 2018 at 9:14 PM, Martin Storsjö <mar...@martin.st> wrote: Clang supports the macro expansion counter (used for making unique labels within macro expansions), but not when targeting darwin. Convert uses of the counter into

[FFmpeg-devel] [PATCH 1/3] arm: swscale: Only compile the rgb2yuv asm if .dn aliases are supported

2018-03-30 Thread Martin Storsjö
Vanilla clang supports altmacro since clang 5.0, and thus doesn't require gas-preprocessor for building the arm assembly any longer. However, the built-in assembler doesn't support .dn directives. This readds checks that were removed in d7320ca3ed10f0d, when the last usage of .dn directives

[FFmpeg-devel] [PATCH 2/3] arm: hevcdsp_deblock: Add commas between macro arguments

2018-03-30 Thread Martin Storsjö
When targeting darwin, clang requires commas between arguments, while the no-comma form is allowed for other targets. Since Xcode 9.3, the bundled clang supports altmacro and doesn't require using gas-preprocessor any longer. --- libavcodec/arm/hevcdsp_deblock_neon.S | 8 1 file

[FFmpeg-devel] [PATCH 3/3] arm: hevcdsp: Avoid using macro expansion counters

2018-03-30 Thread Martin Storsjö
Clang supports the macro expansion counter (used for making unique labels within macro expansions), but not when targeting darwin. Convert uses of the counter into normal local labels, as used elsewhere. Since Xcode 9.3, the bundled clang supports altmacro and doesn't require using

Re: [FFmpeg-devel] [PATCHv3 4/4] libavcodec: v4l2: add support for v4l2 mem2mem codecs

2017-08-08 Thread Martin Storsjö
Hi Jorge, On Mon, 7 Aug 2017, Jorge Ramirez wrote: On 08/03/2017 01:53 AM, Mark Thompson wrote: +default: +return 0; +} + +SET_V4L_EXT_CTRL(value, qmin, avctx->qmin, "minimum video quantizer scale"); +SET_V4L_EXT_CTRL(value, qmax, avctx->qmax, "maximum video

[FFmpeg-devel] [PATCH 1/2] aarch64: vp9: Fix assembling with Xcode 6.2 and older

2017-06-20 Thread Martin Storsjö
From: Memphiz Properly use the b.eq/b.ge forms instead of the nonstandard forms (which both gas and newer clang accept though), and expand the register list that used a range (which the Xcode 6.2 clang, based on clang 3.5 svn, didn't support). This is cherrypicked from libav

[FFmpeg-devel] [PATCH 2/2] aarch64: vp9 16bpp: Fix assembling with Xcode 6.2 and older

2017-06-20 Thread Martin Storsjö
From: Memphiz Properly use the b.eq form instead of the nonstandard form (which both gas and newer clang accept though), and expand the register lists that used a range (which the Xcode 6.2 clang, based on clang 3.5 svn, didn't support). ---

[FFmpeg-devel] [PATCH 14/14] aarch64: vp9itxfm16: Do a simpler half/quarter idct16/idct32 when possible

2017-03-16 Thread Martin Storsjö
This work is sponsored by, and copyright, Google. This avoids loading and calculating coefficients that we know will be zero, and avoids filling the temp buffer with zeros in places where we know the second pass won't read. This gives a pretty substantial speedup for the smaller subpartitions.

[FFmpeg-devel] [PATCH 10/14] arm: vp9itxfm16: Make the larger core transforms standalone functions

2017-03-16 Thread Martin Storsjö
This work is sponsored by, and copyright, Google. This reduces the code size of libavcodec/arm/vp9itxfm_16bpp_neon.o from 17500 to 14516 bytes. This gives a small slowdown of a couple tens of cycles, up to around 150 cycles for the full case of the largest transform, but makes it more feasible

[FFmpeg-devel] [PATCH 11/14] aarch64: vp9itxfm16: Make the larger core transforms standalone functions

2017-03-16 Thread Martin Storsjö
This work is sponsored by, and copyright, Google. This reduces the code size of libavcodec/aarch64/vp9itxfm_16bpp_neon.o from 26288 to 21512 bytes. This gives a small slowdown of a couple of tens of cycles, but makes it more feasible to add more optimized versions of these transforms. Before:

[FFmpeg-devel] [PATCH 08/14] aarch64: vp9itxfm16: Avoid .irp when it doesn't save any lines

2017-03-16 Thread Martin Storsjö
This makes the code a bit more readable. --- libavcodec/aarch64/vp9itxfm_16bpp_neon.S | 24 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S index f80604f..86ea29e 100644

[FFmpeg-devel] [PATCH 03/14] arm/aarch64: vp9: Fix vertical alignment

2017-03-16 Thread Martin Storsjö
Align the second/third operands as they usually are. Due to the wildly varying sizes of the written out operands in aarch64 assembly, the column alignment is usually not as clear as in arm assembly. This is cherrypicked from libav commit 7995ebfad12002033c73feed422a1cfc62081e8f. ---

[FFmpeg-devel] [PATCH 09/14] aarch64: vp9itxfm16: Restructure the idct32 store macros

2017-03-16 Thread Martin Storsjö
This avoids concatenation, which can't be used if the whole macro is wrapped within another macro. --- libavcodec/aarch64/vp9itxfm_16bpp_neon.S | 90 1 file changed, 45 insertions(+), 45 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S

[FFmpeg-devel] [PATCH 12/14] aarch64: vp9itxfm16: Move the load_add_store macro out from the itxfm16 pass2 function

2017-03-16 Thread Martin Storsjö
This allows reusing the macro for a separate implementation of the pass2 function. --- libavcodec/aarch64/vp9itxfm_16bpp_neon.S | 98 1 file changed, 49 insertions(+), 49 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S

[FFmpeg-devel] [PATCH 06/14] arm: vp9itxfm16: Avoid reloading the idct32 coefficients

2017-03-16 Thread Martin Storsjö
Keep the idct32 coefficients in narrow form in q6-q7, and idct16 coefficients in lengthened 32 bit form in q0-q3. Avoid clobbering q0-q3 in the pass1 function, and squeeze the idct16 coefficients into q0-q1 in the pass2 function to avoid reloading them. The idct16 coefficients are clobbered and

[FFmpeg-devel] [PATCH 02/14] arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be used

2017-03-16 Thread Martin Storsjö
In the half/quarter cases where we don't use the min_eob array, defer loading the pointer until we know it will be needed. This is cherrypicked from libav commit 3a0d5e206d24d41d87a25ba16a79b2ea04c39d4c. --- libavcodec/aarch64/vp9itxfm_neon.S | 3 ++- libavcodec/arm/vp9itxfm_neon.S | 4 ++--

[FFmpeg-devel] [PATCH 05/14] arm: vp9itxfm16: Fix vertical alignment

2017-03-16 Thread Martin Storsjö
--- libavcodec/arm/vp9itxfm_16bpp_neon.S | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/libavcodec/arm/vp9itxfm_16bpp_neon.S b/libavcodec/arm/vp9itxfm_16bpp_neon.S index a92f323..9c02ed9 100644 --- a/libavcodec/arm/vp9itxfm_16bpp_neon.S +++

[FFmpeg-devel] [PATCH 07/14] aarch64: vp9itxfm16: Fix a typo in a comment

2017-03-16 Thread Martin Storsjö
--- libavcodec/aarch64/vp9itxfm_16bpp_neon.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S index f53e94a..f80604f 100644 --- a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S +++

[FFmpeg-devel] [PATCH 13/14] arm: vp9itxfm16: Do a simpler half/quarter idct16/idct32 when possible

2017-03-16 Thread Martin Storsjö
This work is sponsored by, and copyright, Google. This avoids loading and calculating coefficients that we know will be zero, and avoids filling the temp buffer with zeros in places where we know the second pass won't read. This gives a pretty substantial speedup for the smaller subpartitions.

[FFmpeg-devel] [PATCH 01/14] arm: vp9itxfm: Template the quarter/half idct32 function

2017-03-16 Thread Martin Storsjö
This reduces the number of lines and reduces the duplication. Also simplify the eob check for the half case. If we are in the half case, we know we at least will need to do the first three slices, we only need to check eob for the fourth one, so we can hardcode the value to check against instead

[FFmpeg-devel] [PATCH 04/14] arm: vp9itxfm16: Use the right lane size

2017-03-16 Thread Martin Storsjö
This makes the code slightly clearer, but doesn't make any functional difference. --- libavcodec/arm/vp9itxfm_16bpp_neon.S | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/libavcodec/arm/vp9itxfm_16bpp_neon.S b/libavcodec/arm/vp9itxfm_16bpp_neon.S index e6e9440..a92f323

[FFmpeg-devel] [PATCH 20/34] arm/aarch64: vp9lpf: Calculate !hev directly

2017-03-08 Thread Martin Storsjö
Previously we first calculated hev, and then negated it. Since we were able to schedule the negation in the middle of another calculation, we don't see any gain in all cases. Before: Cortex A7 A8 A9 A53 A53/AArch64 vp9_loop_filter_v_4_8_neon: 147.0 129.0

[FFmpeg-devel] [PATCH 12/34] aarch64: vp9itxfm: Use the right lane sizes in 8x8 for improved readability

2017-03-08 Thread Martin Storsjö
This is cherrypicked from libav commit 3dd7827258ddaa2e51085d0c677d6f3b1be3572f. --- libavcodec/aarch64/vp9itxfm_neon.S | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index

[FFmpeg-devel] [PATCH 09/34] arm: vp9itxfm: Share instructions for loading idct coeffs in the 8x8 function

2017-03-08 Thread Martin Storsjö
This is cherrypicked from libav commit 3933b86bb93aca47f29fbd493075b0f110c1e3f5. --- libavcodec/arm/vp9itxfm_neon.S | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S index 33a7af1..78fdae6 100644 ---

[FFmpeg-devel] [PATCH 11/34] aarch64: vp9itxfm: Use a single lane ld1 instead of ld1r where possible

2017-03-08 Thread Martin Storsjö
The ld1r is a leftover from the arm version, where this trick is beneficial on some cores. Use a single-lane load where we don't need the semantics of ld1r. This is cherrypicked from libav commit ed8d293306e12c9b79022d37d39f48825ce7f2fa. --- libavcodec/aarch64/vp9itxfm_neon.S | 16

[FFmpeg-devel] [PATCH 10/34] aarch64: vp9itxfm: Share instructions for loading idct coeffs in the 8x8 function

2017-03-08 Thread Martin Storsjö
This is cherrypicked from libav commit 4da4b2b87f08a1331650c7e36eb7d4029a160776. --- libavcodec/aarch64/vp9itxfm_neon.S | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index 3eb999a..df178d2 100644 ---

[FFmpeg-devel] [PATCH 19/34] aarch64: vp9itxfm: Optimize 16x16 and 32x32 idct dc by unrolling

2017-03-08 Thread Martin Storsjö
This work is sponsored by, and copyright, Google. Before: Cortex A53 vp9_inv_dct_dct_16x16_sub1_add_neon: 235.3 vp9_inv_dct_dct_32x32_sub1_add_neon: 555.1 After: vp9_inv_dct_dct_16x16_sub1_add_neon: 180.2 vp9_inv_dct_dct_32x32_sub1_add_neon: 475.3 This is

[FFmpeg-devel] [PATCH 13/34] aarch64: vp9itxfm: Update a comment to refer to a register with a different name

2017-03-08 Thread Martin Storsjö
This is cherrypicked from libav commit 8476eb0d3ab1f7a52317b23346646389c08fb57a. --- libavcodec/aarch64/vp9itxfm_neon.S | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index 3b34749..5219d6e 100644

[FFmpeg-devel] [PATCH 18/34] arm: vp9itxfm: Optimize 16x16 and 32x32 idct dc by unrolling

2017-03-08 Thread Martin Storsjö
This work is sponsored by, and copyright, Google. Before:Cortex A7 A8 A9 A53 vp9_inv_dct_dct_16x16_sub1_add_neon: 273.0 189.5 211.7 235.8 vp9_inv_dct_dct_32x32_sub1_add_neon: 752.0 459.2 862.2 553.9 After:

[FFmpeg-devel] [PATCH 05/34] arm: vp9itxfm: Move the load_add_store macro out from the itxfm16 pass2 function

2017-03-08 Thread Martin Storsjö
This allows reusing the macro for a separate implementation of the pass2 function. This is cherrypicked from libav commit 47b3c2c18d1897f3c753ba0cec4b2d7aa24526af. --- libavcodec/arm/vp9itxfm_neon.S | 72 +- 1 file changed, 36 insertions(+), 36

[FFmpeg-devel] [PATCH 21/34] arm: vp9lpf: Use orrs instead of orr+cmp

2017-03-08 Thread Martin Storsjö
This is cherrypicked from libav commit 435cd7bc99671bf561193421a50ac6e9d63c4266. --- libavcodec/arm/vp9lpf_neon.S | 12 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/libavcodec/arm/vp9lpf_neon.S b/libavcodec/arm/vp9lpf_neon.S index 2761956..3d289e5 100644 ---

[FFmpeg-devel] [PATCH 06/34] aarch64: vp9itxfm: Move the load_add_store macro out from the itxfm16 pass2 function

2017-03-08 Thread Martin Storsjö
This allows reusing the macro for a separate implementation of the pass2 function. This is cherrypicked from libav commit 79d332ebbde8c0a3e9da094dcfd10abd33ba7378. --- libavcodec/aarch64/vp9itxfm_neon.S | 90 +++--- 1 file changed, 45 insertions(+), 45

[FFmpeg-devel] [PATCH 03/34] arm: vp9itxfm: Make the larger core transforms standalone functions

2017-03-08 Thread Martin Storsjö
This work is sponsored by, and copyright, Google. This reduces the code size of libavcodec/arm/vp9itxfm_neon.o from 15324 to 12388 bytes. This gives a small slowdown of a couple tens of cycles, up to around 150 cycles for the full case of the largest transform, but makes it more feasible to add

[FFmpeg-devel] [PATCH 22/34] arm: vp9lpf: Interleave the start of flat8in into the calculation above

2017-03-08 Thread Martin Storsjö
This adds lots of extra .ifs, but speeds it up by a couple cycles, by avoiding stalls. This is cherrypicked from libav commit e18c39005ad1dbb178b336f691da1de91afd434e. --- libavcodec/arm/vp9lpf_neon.S | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git

[FFmpeg-devel] [PATCH 34/34] aarch64: vp9itxfm: Reorder iadst16 coeffs

2017-03-08 Thread Martin Storsjö
This matches the order they are in the 16 bpp version. There they are in this order, to make sure we access them in the same order they are declared, easing loading only half of the coefficients at a time. This makes the 8 bpp version match the 16 bpp version better. This is cherrypicked from

[FFmpeg-devel] [PATCH 33/34] arm: vp9itxfm: Reorder iadst16 coeffs

2017-03-08 Thread Martin Storsjö
This matches the order they are in the 16 bpp version. There they are in this order, to make sure we access them in the same order they are declared, easing loading only half of the coefficients at a time. This makes the 8 bpp version match the 16 bpp version better. This is cherrypicked from

[FFmpeg-devel] [PATCH 28/34] arm: vp9lpf: Implement the mix2_44 function with one single filter pass

2017-03-08 Thread Martin Storsjö
For this case, with 8 inputs but only changing 4 of them, we can fit all 16 input pixels into a q register, and still have enough temporary registers for doing the loop filter. The wd=8 filters would require too many temporary registers for processing all 16 pixels at once though. Before:

[FFmpeg-devel] [PATCH 30/34] aarch64: vp9itxfm: Avoid reloading the idct32 coefficients

2017-03-08 Thread Martin Storsjö
The idct32x32 function actually pushed d8-d15 onto the stack even though it didn't clobber them; there are plenty of registers that can be used to allow keeping all the idct coefficients in registers without having to reload different subsets of them at different stages in the transform. After

[FFmpeg-devel] [PATCH 27/34] aarch64: vp9lpf: Use dup+rev16+uzp1 instead of dup+lsr+dup+trn1

2017-03-08 Thread Martin Storsjö
This is one cycle faster in total, and three instructions fewer. Before: vp9_loop_filter_mix2_v_44_16_neon: 123.2 After: vp9_loop_filter_mix2_v_44_16_neon: 122.2 This is cherrypicked from libav commit 3bf9c48320f25f3d5557485b0202f22ae60748b0. --- libavcodec/aarch64/vp9lpf_neon.S | 21

[FFmpeg-devel] [PATCH 31/34] arm: vp9itxfm: Reorder the idct coefficients for better pairing

2017-03-08 Thread Martin Storsjö
All elements are used pairwise, except for the first one. Previously, the 16th element was unused. Move the unused element to the second slot, to make the later element pairs not split across registers. This simplifies loading only parts of the coefficients, reducing the difference to the 16 bpp

[FFmpeg-devel] [PATCH 32/34] aarch64: vp9itxfm: Reorder the idct coefficients for better pairing

2017-03-08 Thread Martin Storsjö
All elements are used pairwise, except for the first one. Previously, the 16th element was unused. Move the unused element to the second slot, to make the later element pairs not split across registers. This simplifies loading only parts of the coefficients, reducing the difference to the 16 bpp

[FFmpeg-devel] [PATCH 29/34] arm: vp9itxfm: Avoid reloading the idct32 coefficients

2017-03-08 Thread Martin Storsjö
The idct32x32 function actually pushed q4-q7 onto the stack even though it didn't clobber them; there are plenty of registers that can be used to allow keeping all the idct coefficients in registers without having to reload different subsets of them at different stages in the transform. Since the

[FFmpeg-devel] [PATCH 26/34] arm/aarch64: vp9lpf: Keep the comparison to E within 8 bit

2017-03-08 Thread Martin Storsjö
The theoretical maximum value of E is 193, so we can just saturate the addition to 255. Before: Cortex A7 A8 A9 A53 A53/AArch64 vp9_loop_filter_v_4_8_neon: 143.0 127.7 114.888.0 87.7 vp9_loop_filter_v_8_8_neon: 241.0 197.2 173.7

[FFmpeg-devel] [PATCH 25/34] aarch64: Add parentheses around the offset parameter in movrel

2017-03-08 Thread Martin Storsjö
This fixes building with clang for linux with PIC enabled. This is cherrypicked from libav commit 8847eeaa14189885038140fb2b8a7adc7100. --- libavutil/aarch64/asm.S | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/libavutil/aarch64/asm.S b/libavutil/aarch64/asm.S index

[FFmpeg-devel] [PATCH 24/34] aarch64: vp9lpf: Fix broken indentation/vertical alignment

2017-03-08 Thread Martin Storsjö
This is cherrypicked from libav commit 07b5136c481d394992c7e951967df0cfbb346c0b. --- libavcodec/aarch64/vp9lpf_neon.S | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/libavcodec/aarch64/vp9lpf_neon.S b/libavcodec/aarch64/vp9lpf_neon.S index cd3e26c..ebfd9be 100644 ---

[FFmpeg-devel] [PATCH 23/34] aarch64: vp9lpf: Interleave the start of flat8in into the calculation above

2017-03-08 Thread Martin Storsjö
This adds lots of extra .ifs, but speeds it up by a couple cycles, by avoiding stalls. This is cherrypicked from libav commit b0806088d3b27044145b20421da8d39089ae0c6a. --- libavcodec/aarch64/vp9lpf_neon.S | 14 +++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git

[FFmpeg-devel] [PATCH 17/34] aarch64: vp9mc: Calculate less unused data in the 4 pixel wide horizontal filter

2017-03-08 Thread Martin Storsjö
No measured speedup on a Cortex A53, but other cores might benefit. This is cherrypicked from libav commit 388e0d2515bc6bbc9d0c9af1d230bd16cf945fe7. --- libavcodec/aarch64/vp9mc_neon.S | 15 +-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git

[FFmpeg-devel] [PATCH 16/34] arm: vp9mc: Calculate less unused data in the 4 pixel wide horizontal filter

2017-03-08 Thread Martin Storsjö
Before:Cortex A7 A8 A9 A53 vp9_put_8tap_smooth_4h_neon: 378.1 273.2 340.7 229.5 After: vp9_put_8tap_smooth_4h_neon: 352.1 222.2 290.5 229.5 This is cherrypicked from libav commit fea92a4b57d1c328b1de226a5f213a629ee63754. ---

[FFmpeg-devel] [PATCH 15/34] aarch64: vp9mc: Simplify the extmla macro parameters

2017-03-08 Thread Martin Storsjö
Fold the field lengths into the macro. This makes the macro invocations much more readable, when the lines are shorter. This also makes it easier to use only half the registers within the macro. This is cherrypicked from libav commit 5e0c2158fbc774f87d3ce4b7b950ba4d42c4a7b8. ---

[FFmpeg-devel] [PATCH 08/34] aarch64: vp9itxfm: Do separate functions for half/quarter idct16 and idct32

2017-03-08 Thread Martin Storsjö
This work is sponsored by, and copyright, Google. This avoids loading and calculating coefficients that we know will be zero, and avoids filling the temp buffer with zeros in places where we know the second pass won't read. This gives a pretty substantial speedup for the smaller subpartitions.

[FFmpeg-devel] [PATCH 07/34] arm: vp9itxfm: Do a simpler half/quarter idct16/idct32 when possible

2017-03-08 Thread Martin Storsjö
This work is sponsored by, and copyright, Google. This avoids loading and calculating coefficients that we know will be zero, and avoids filling the temp buffer with zeros in places where we know the second pass won't read. This gives a pretty substantial speedup for the smaller subpartitions.

[FFmpeg-devel] [PATCH 14/34] aarch64: vp9itxfm: Fix incorrect vertical alignment

2017-03-08 Thread Martin Storsjö
This is cherrypicked from libav commit 0c0b87f12d48d4e7f0d3d13f9345e828a3a5ea32. --- libavcodec/aarch64/vp9itxfm_neon.S | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index 5219d6e..6bb097b 100644

[FFmpeg-devel] [PATCH 04/34] aarch64: vp9itxfm: Make the larger core transforms standalone functions

2017-03-08 Thread Martin Storsjö
This work is sponsored by, and copyright, Google. This reduces the code size of libavcodec/aarch64/vp9itxfm_neon.o from 19496 to 14740 bytes. This gives a small slowdown of a couple of tens of cycles, but makes it more feasible to add more optimized versions of these transforms. Before:

[FFmpeg-devel] [PATCH 02/34] aarch64: vp9itxfm: Restructure the idct32 store macros

2017-03-08 Thread Martin Storsjö
This avoids concatenation, which can't be used if the whole macro is wrapped within another macro. This is also arguably more readable. This is cherrypicked from libav commit 58d87e0f49bcbbc6f426328f53b657bae7430cd2. --- libavcodec/aarch64/vp9itxfm_neon.S | 80

[FFmpeg-devel] [PATCH 01/34] arm: vp9itxfm: Avoid .irp when it doesn't save any lines

2017-03-08 Thread Martin Storsjö
This makes it more readable. This is cherrypicked from libav commit 3bc5b28d5a191864c54bba60646933a63da31656. --- libavcodec/arm/vp9itxfm_neon.S | 24 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/libavcodec/arm/vp9itxfm_neon.S

Re: [FFmpeg-devel] [PATCH 1/8] arm: vp9dsp: Restructure the bpp checks

2017-01-24 Thread Martin Storsjö
On Thu, 19 Jan 2017, Michael Niedermayer wrote: On Wed, Jan 18, 2017 at 11:45:08PM +0200, Martin Storsjö wrote: This work is sponsored by, and copyright, Google. This is more in line with how it will be extended for more bitdepths. --- libavcodec/arm/vp9dsp_init_arm.c | 24

[FFmpeg-devel] [PATCH 4/8] arm: Add NEON optimizations for 10 and 12 bit vp9 loop filter

2017-01-18 Thread Martin Storsjö
This work is sponsored by, and copyright, Google. This is pretty much similar to the 8 bpp version, but in some senses simpler. All input pixels are 16 bits, and all intermediates also fit in 16 bits, so there's no lengthening/narrowing in the filter at all. For the full 16 pixel wide filter, we

[FFmpeg-devel] [PATCH 1/8] arm: vp9dsp: Restructure the bpp checks

2017-01-18 Thread Martin Storsjö
This work is sponsored by, and copyright, Google. This is more in line with how it will be extended for more bitdepths. --- libavcodec/arm/vp9dsp_init_arm.c | 24 +--- 1 file changed, 9 insertions(+), 15 deletions(-) diff --git a/libavcodec/arm/vp9dsp_init_arm.c

[FFmpeg-devel] [PATCH 6/8] aarch64: Add NEON optimizations for 10 and 12 bit vp9 MC

2017-01-18 Thread Martin Storsjö
This work is sponsored by, and copyright, Google. This has mostly got the same differences to the 8 bit version as in the arm version. For the horizontal filters, we do 16 pixels in parallel as well. For the 8 pixel wide vertical filters, we can accumulate 4 rows before storing, just as in the 8

[FFmpeg-devel] [PATCH 7/8] aarch64: Add NEON optimizations for 10 and 12 bit vp9 itxfm

2017-01-18 Thread Martin Storsjö
This work is sponsored by, and copyright, Google. Compared to the arm version, on aarch64 we can keep the full 8x8 transform in registers, and for 16x16 and 32x32, we can process it in slices of 4 pixels instead of 2. Examples of runtimes vs the 32 bit version, on a Cortex A53:

[FFmpeg-devel] [PATCH 3/8] arm: Add NEON optimizations for 10 and 12 bit vp9 itxfm

2017-01-18 Thread Martin Storsjö
This work is sponsored by, and copyright, Google. This is structured similarly to the 8 bit version. In the 8 bit version, the coefficients are 16 bits, and intermediates are 32 bits. Here, the coefficients are 32 bit. For the 4x4 transforms for 10 bit content, the intermediates also fit in 32

[FFmpeg-devel] [PATCH 8/8] aarch64: Add NEON optimizations for 10 and 12 bit vp9 loop filter

2017-01-18 Thread Martin Storsjö
This work is sponsored by, and copyright, Google. This is similar to the arm version, but due to the larger registers on aarch64, we can do 8 pixels at a time for all filter sizes. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM

[FFmpeg-devel] [PATCH 2/8] arm: Add NEON optimizations for 10 and 12 bit vp9 MC

2017-01-18 Thread Martin Storsjö
This work is sponsored by, and copyright, Google. The plain pixel put/copy functions are used from the 8 bit version, for the double size (e.g. put16 uses ff_vp9_copy32_neon), and a new copy128 is added. Compared with the 8 bit version, the filters can no longer use the trick to accumulate in 16

[FFmpeg-devel] [PATCH 5/8] aarch64: vp9dsp: Restructure the bpp checks

2017-01-18 Thread Martin Storsjö
This work is sponsored by, and copyright, Google. This is more in line with how it will be extended for more bitdepths. --- libavcodec/aarch64/vp9dsp_init_aarch64.c | 24 +--- 1 file changed, 9 insertions(+), 15 deletions(-) diff --git

[FFmpeg-devel] [PATCH 08/13] arm: vp9itxfm: Only reload the idct coeffs for the iadst_idct combination

2017-01-09 Thread Martin Storsjö
This avoids reloading them if they haven't been clobbered, if the first pass also was idct. This is similar to what was done in the aarch64 version. This is cherrypicked from libav commit 3c87039a404c5659ae9bf7454a04e186532eb40b. --- libavcodec/arm/vp9itxfm_neon.S | 2 +- 1 file changed, 1

[FFmpeg-devel] [PATCH 07/13] aarch64: vp9itxfm: Don't repeatedly set x9 when nothing overwrites it

2017-01-09 Thread Martin Storsjö
This is cherrypicked from libav commit 2f99117f6ff24ce5be2abb9e014cb8b86c2aa0e0. --- libavcodec/aarch64/vp9itxfm_neon.S | 26 +++--- 1 file changed, 15 insertions(+), 11 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S b/libavcodec/aarch64/vp9itxfm_neon.S index

[FFmpeg-devel] [PATCH 12/13] aarch64: vp9dsp: Fix vertical alignment in the init file

2017-01-09 Thread Martin Storsjö
This is cherrypicked from libav commit 65074791e8f8397600aacc9801efdd1eb6e3. --- libavcodec/aarch64/vp9dsp_init_aarch64.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/libavcodec/aarch64/vp9dsp_init_aarch64.c b/libavcodec/aarch64/vp9dsp_init_aarch64.c index

[FFmpeg-devel] [PATCH 03/13] arm: vp9itxfm: Simplify the stack alignment code

2017-01-09 Thread Martin Storsjö
From: Janne Grunau This is one instruction less for thumb, and only have got 1/2 arm/thumb specific instructions. This is cherrypicked from libav commit e5b0fc170f85b00f7dd0ac514918fb5c95253d39. --- libavcodec/arm/vp9itxfm_neon.S | 28 1

[FFmpeg-devel] [PATCH 02/13] aarch64: vp9: loop filter: replace 'orr; cbn?z' with 'adds; b.{eq,ne};

2017-01-09 Thread Martin Storsjö
From: Janne Grunau The latter is 1 cycle faster on a cortex-53 and since the operands are bytewise (or larger) bitmask (impossible to overflow to zero) both are equivalent. This is cherrypicked from libav commit e7ae8f7a715843a5089d18e033afb3ee19ab3057. ---

[FFmpeg-devel] [PATCH 04/13] aarch64: vp9itxfm: Use w3 instead of x3 for the int eob parameter

2017-01-09 Thread Martin Storsjö
The clobbering tests in checkasm are only invoked when testing correctness, so this bug didn't show up when benchmarking the dc-only version. This is cherrypicked from libav commit 4d960a11855f4212eb3a4e470ce890db7f01df29. --- libavcodec/aarch64/vp9itxfm_neon.S | 8 1 file changed, 4

[FFmpeg-devel] [PATCH 11/13] arm: vp9mc: Fix vertical alignment of operands

2017-01-09 Thread Martin Storsjö
This is cherrypicked from libav commit c536e5e8698110c139b1c17938998a5547550aa3. --- libavcodec/arm/vp9mc_neon.S | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/libavcodec/arm/vp9mc_neon.S b/libavcodec/arm/vp9mc_neon.S index 5fe3024..83235ff 100644 ---

[FFmpeg-devel] [PATCH 05/13] arm/aarch64: vp9itxfm: Fix indentation of macro arguments

2017-01-09 Thread Martin Storsjö
This is cherrypicked from libav commit 721bc37522c5c1d6a8c3cea5e9c3fcde8d256c05. --- libavcodec/aarch64/vp9itxfm_neon.S | 16 libavcodec/arm/vp9itxfm_neon.S | 8 2 files changed, 12 insertions(+), 12 deletions(-) diff --git a/libavcodec/aarch64/vp9itxfm_neon.S

[FFmpeg-devel] [PATCH 13/13] aarch64: vp9mc: Fix a comment to refer to a register with the right name

2017-01-09 Thread Martin Storsjö
This is cherrypicked from libav commit 85ad5ea72ce3983947a3b07e4b35c66cb16dfaba. --- libavcodec/aarch64/vp9mc_neon.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/libavcodec/aarch64/vp9mc_neon.S b/libavcodec/aarch64/vp9mc_neon.S index 69dad6d..80d1d23 100644 ---

[FFmpeg-devel] [PATCH 06/13] arm: vp9itxfm: Rename a macro parameter to fit better

2017-01-09 Thread Martin Storsjö
Since the same parameter is used for both input and output, the name inout is more fitting. This matches the naming used below in the dmbutterfly macro. This is cherrypicked from libav commit 79566ec8c77969d5f9be533de04b1349834cca62. --- libavcodec/arm/vp9itxfm_neon.S | 14 +++--- 1

[FFmpeg-devel] [PATCH 09/13] arm: vp9itxfm: Skip empty slices in the first pass of idct_idct 16x16 and 32x32

2017-01-09 Thread Martin Storsjö
This work is sponsored by, and copyright, Google. Previously all subpartitions except the eob=1 (DC) case ran with the same runtime: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_16x16_sub16_add_neon: 3188.1 2435.4 2499.0 1969.0

[FFmpeg-devel] [PATCH 01/13] aarch64: vp9: use alternative returns in the core loop filter function

2017-01-09 Thread Martin Storsjö
From: Janne Grunau Since aarch64 has enough free general purpose registers use them to branch to the appropiate storage code. 1-2 cycles faster for the functions using loop_filter 8/16, ... on a cortex-a53. Mixed results (up to 2 cycles faster/slower) on a cortex-a57.

[FFmpeg-devel] [PATCH 10/13] aarch64: vp9itxfm: Skip empty slices in the first pass of idct_idct 16x16 and 32x32

2017-01-09 Thread Martin Storsjö
This work is sponsored by, and copyright, Google. Previously all subpartitions except the eob=1 (DC) case ran with the same runtime: vp9_inv_dct_dct_16x16_sub16_add_neon: 1373.2 vp9_inv_dct_dct_32x32_sub32_add_neon: 8089.0 By skipping individual 8x16 or 8x32 pixel slices in the first pass,

Re: [FFmpeg-devel] [PATCH 1/9] vp9dsp: Deduplicate the subpel filters

2016-11-14 Thread Martin Storsjö
On Mon, 14 Nov 2016, Ronald S. Bultje wrote: Hi, On Mon, Nov 14, 2016 at 5:32 AM, Martin Storsjö <mar...@martin.st> wrote: Make them aligned, to allow efficient access to them from simd. This is an adapted cherry-pick from libav commit a4cfcddcb0f76e837d5abc06840c2b26c0

[FFmpeg-devel] [PATCH 1/9] vp9dsp: Deduplicate the subpel filters

2016-11-14 Thread Martin Storsjö
Make them aligned, to allow efficient access to them from simd. This is an adapted cherry-pick from libav commit a4cfcddcb0f76e837d5abc06840c2b26c0e8aefc. --- libavcodec/vp9dsp.c | 56 +++ libavcodec/vp9dsp.h | 3 +++

[FFmpeg-devel] [PATCH 2/9] arm: Clear the gp register alias at the end of functions

2016-11-14 Thread Martin Storsjö
We reset .Lpic_gp to zero at the start of each function, which means that the logic within movrelx for clearing gp when necessary will be missed. This fixes using movrelx in different functions with a different helper register. This is cherry-picked from libav commit

[FFmpeg-devel] [PATCH 9/9] aarch64: vp9: Implement NEON loop filters

2016-11-14 Thread Martin Storsjö
This work is sponsored by, and copyright, Google. These are ported from the ARM version; thanks to the larger amount of registers available, we can do the loop filters with 16 pixels at a time. The implementation is fully templated, with a single macro which can generate versions for both 8 and

[FFmpeg-devel] [PATCH 4/9] arm: vp9: Add NEON itxfm routines

2016-11-14 Thread Martin Storsjö
This work is sponsored by, and copyright, Google. For the transforms up to 8x8, we can fit all the data (including temporaries) in registers and just do a straightforward transform of all the data. For 16x16, we do a transform of 4x16 pixels in 4 slices, using a temporary buffer. For 32x32, we

[FFmpeg-devel] [PATCH 3/9] arm: vp9: Add NEON optimizations of VP9 MC functions

2016-11-14 Thread Martin Storsjö
This work is sponsored by, and copyright, Google. The filter coefficients are signed values, where the product of the multiplication with one individual filter coefficient doesn't overflow a 16 bit signed value (the largest filter coefficient is 127). But when the products are accumulated, the

[FFmpeg-devel] [PATCH 5/9] arm: vp9: Add NEON loop filters

2016-11-14 Thread Martin Storsjö
This work is sponsored by, and copyright, Google. The implementation tries to have smart handling of cases where no pixels need the full filtering for the 8/16 width filters, skipping both calculation and writeback of the unmodified pixels in those cases. The actual effect of this is hard to test

  1   2   >