Re: [FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm

2017-08-06 Thread Henrik Gramner
On Sat, Aug 5, 2017 at 12:58 AM, Ivan Kalvachev  wrote:
> 8 packed, 8 scalar.
>
> Unless I miss something (and as I've said before,
> I'm not confident enough to mess with that code.)
>
> (AVX does extend to 32 variants, but they are not
> SSE compatible, so no need to emulate them.)

Oh, right. I quickly glanced at the docs and saw 32 pseudo-ops for
each instruction for a total of 128 when adding pd, ps, sd, ss, but
the fact that only the first 8 is relevant here reduces it to 32 which
is a lot more manageable.

> movaps m1, [WRT_PIC_BASE + const_2 + r2 ]
>
> Looks better. (Also not tested. Will do, later.)

I intentionally used the WRT define at the end because that's most
similar to the built in wrt syntax used when accessing symbols through
the PLT or GOT, e.g.

mov eax, [external_symbol wrt ..got]

> Yeh $$ is the start of the current section, and that's is going to be
> ".text"  not "rodata".

Obviously, yes. You need a reference that results in a compile-time
constant PC-offset (which .rodata isn't) to create PC-relative
relocation records to external symbols.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm

2017-08-05 Thread Ivan Kalvachev
On 8/5/17, Hendrik Leppkes  wrote:
> On Sat, Aug 5, 2017 at 12:21 PM, Ivan Kalvachev 
> wrote:
>> On 8/5/17, Rostislav Pehlivanov  wrote:
>>> On 4 August 2017 at 23:58, Ivan Kalvachev  wrote:
>>>

 >> I had it mixed before, but I changed it to be consistent.
 >> Some new avx instruction are not overloaded or
 >> available without their "v-" prefix.
 >> e.g. x86util uses vpbroadcastw in SPLATW.
 >
 > Every instruction that has both a legacy encoding and a VEX-encoding
 > will be automatically converted to VEX in AVX functions when written
 > without the v-prefix. There is no legacy encoding for newer
 > instructions so there's no reason for x86inc to overload those.

 That's great, but it doesn't address the original problem.
 Do you insist on removing the "v-" prefix from AVX only instructions?


>>> I insist.
>>
>> If you insist, then you should have a good technical reason for it.
>> Something that is not based on looks, cosmetics or "it has always been
>> like this".
>>
>> For example, slow compilation because of unnecessary macro expansion is
>> a technical reason to keep the "v-" prefix.
>>
>
> Consistency with the typical style of our codebase is also a reason.

No, it is not. This fall in category of doing things wrong,
because we have always done them wrong.
Also you brought up the matter of compilation speed,
now you don't care about it? ;)

Don't worry, I will do the requested change,
only because in future it might help with
avx1 & ymm integer test.

Please answer my other question(s).
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm

2017-08-05 Thread Hendrik Leppkes
On Sat, Aug 5, 2017 at 12:21 PM, Ivan Kalvachev  wrote:
> On 8/5/17, Rostislav Pehlivanov  wrote:
>> On 4 August 2017 at 23:58, Ivan Kalvachev  wrote:
>>
>>>
>>> >> I had it mixed before, but I changed it to be consistent.
>>> >> Some new avx instruction are not overloaded or
>>> >> available without their "v-" prefix.
>>> >> e.g. x86util uses vpbroadcastw in SPLATW.
>>> >
>>> > Every instruction that has both a legacy encoding and a VEX-encoding
>>> > will be automatically converted to VEX in AVX functions when written
>>> > without the v-prefix. There is no legacy encoding for newer
>>> > instructions so there's no reason for x86inc to overload those.
>>>
>>> That's great, but it doesn't address the original problem.
>>> Do you insist on removing the "v-" prefix from AVX only instructions?
>>>
>>>
>> I insist.
>
> If you insist, then you should have a good technical reason for it.
> Something that is not based on looks, cosmetics or "it has always been
> like this".
>
> For example, slow compilation because of unnecessary macro expansion is
> a technical reason to keep the "v-" prefix.
>

Consistency with the typical style of our codebase is also a reason.

- Hendrik
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm

2017-08-05 Thread Ivan Kalvachev
On 8/5/17, Rostislav Pehlivanov  wrote:
> On 4 August 2017 at 23:58, Ivan Kalvachev  wrote:
>
>>
>> >> I had it mixed before, but I changed it to be consistent.
>> >> Some new avx instruction are not overloaded or
>> >> available without their "v-" prefix.
>> >> e.g. x86util uses vpbroadcastw in SPLATW.
>> >
>> > Every instruction that has both a legacy encoding and a VEX-encoding
>> > will be automatically converted to VEX in AVX functions when written
>> > without the v-prefix. There is no legacy encoding for newer
>> > instructions so there's no reason for x86inc to overload those.
>>
>> That's great, but it doesn't address the original problem.
>> Do you insist on removing the "v-" prefix from AVX only instructions?
>>
>>
> I insist.

If you insist, then you should have a good technical reason for it.
Something that is not based on looks, cosmetics or "it has always been
like this".

For example, slow compilation because of unnecessary macro expansion is
a technical reason to keep the "v-" prefix.

;)
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm

2017-08-04 Thread Rostislav Pehlivanov
On 4 August 2017 at 23:58, Ivan Kalvachev  wrote:

>
> >> I had it mixed before, but I changed it to be consistent.
> >> Some new avx instruction are not overloaded or
> >> available without their "v-" prefix.
> >> e.g. x86util uses vpbroadcastw in SPLATW.
> >
> > Every instruction that has both a legacy encoding and a VEX-encoding
> > will be automatically converted to VEX in AVX functions when written
> > without the v-prefix. There is no legacy encoding for newer
> > instructions so there's no reason for x86inc to overload those.
>
> That's great, but it doesn't address the original problem.
> Do you insist on removing the "v-" prefix from AVX only instructions?
>
>
I insist.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm

2017-08-04 Thread Ivan Kalvachev
On 8/4/17, Henrik Gramner  wrote:
> On Thu, Aug 3, 2017 at 11:36 PM, Ivan Kalvachev 
> wrote:
>>> 1234_1234_1234_123
>>> VBROADCASTSS ym1, xm1
>>> BLENDVPS  m1, m2, m3
>>>
>>> is the most commonly used alignment.
>>
>> I see that a lot of .asm files use different alignments.
>> I'll try to pick something similar that I find good looking.
>>
>> I also see that different function/macro can have
>> different position for the first comma.
>> This could be quite useful, to visually separate
>> the macros.
>
> Things like indentation can be a bit inconsistent at times, yes. Cody
> by different authors written over the span of many year etc.
>
> It's normal to diverge from "standard alignment" in cases when macro
> names (or memory addresses) are fairly long. Otherwise the amount of
> whitespace between instructions and operands would easily get too
> large to be practical.

I tried different things. The one space after comma indeed looks best.
I used 2 spaces after instruction.

>>> Now that I think about it I believe that part wasn't merged because
>>> smartalign is broken on some older nasm versions (which is another
>>> reason for avoiding things like this in individual files) and FFmpeg
>>> doesn't enforce any specific version requirements. I guess it could be
>>> worked around with some version checking ifdeffery if we know which
>>> versions are affected.
>>
>> I don't see anything relevant in here:
>> http://repo.or.cz/nasm.git/history/HEAD:/macros/smartalign.mac
>> I also didn't notice anything similar in the changelog.
> http://repo.or.cz/nasm.git/commit/f29123b936b1751c6d72dad86b05eb293cca5c6c
> https://bugzilla.nasm.us/show_bug.cgi?id=3392319

This would break generation of dependencies?
I think that FFmpeg uses combined mode with nasm,
so it should not call preprocessor only mode,
unless DBG=1 is used.

I'll investigate this further.
(revert the nasm fix, try building ffmpeg dependencies ...)

>> I had it mixed before, but I changed it to be consistent.
>> Some new avx instruction are not overloaded or
>> available without their "v-" prefix.
>> e.g. x86util uses vpbroadcastw in SPLATW.
>
> Every instruction that has both a legacy encoding and a VEX-encoding
> will be automatically converted to VEX in AVX functions when written
> without the v-prefix. There is no legacy encoding for newer
> instructions so there's no reason for x86inc to overload those.

That's great, but it doesn't address the original problem.
Do you insist on removing the "v-" prefix from AVX only instructions?

You said that this is purely cosmetic and
I think that it would make compilation a bit slower,
because of unnecessary macro expansions. ;)

>> Isn't one of the selling points for x86inc that
>> it would warn if unsupported instructions are used?
>
> Not really, no. It's done in a small subset of cases as a part of the
> aforementioned auto legacy/VEX conversion because it was trivial to
> add that functionality, but correctly tracking every single x86
> instruction in existence would be a whole different thing.
>
> Some instructions requires different instruction sets not just
> depending on the vector width but the operand types as well (for
> example, pextrw from mm to gpr requires mmx2, from xmm to gpr requires
> sse2, from xmm to memory requires sse4.1) and there's no generic way
> to handle every special corner case which would make things fairly
> complex. I'm guessing doing so would also cause a significant
> reduction in compilation speed which I'm sure some people would
> dislike.

There is already check for register size and integer operations,
because SSE1 does not have 128bit xmm integer operations.
It should be just few lines of code to add the same check for
AVX1 and 256bit ymm integer operations.
(I'm not confident enough to mess with that code, yet.)

As for the immediate task:
The "error" message stays.

Maybe I should also add "use PBROADCASTD instead" to it?


>>> Add your improvements to the existing x86util macro (can be done in
>>> the same patch) and then use that.
>>
>> FFmpeg guidelines request that it is a separate patch in separate mail.
>
> Doesn't really seem to be followed historically by looking at the git
> history, but doing things by the book is never wrong so making it a
> separate commit might be the correct choice.

Will do.

>> Just renaming it is not enough, because SPLATD
>> works differently (it can broadcast second, third or forth element too)
>> and it is already used in existing code.
>> Maybe PBROADCASTD macro could use SPLATD internally.
>
> SPLATD is hardcoded to broadcast the lowest element.
>
> http://git.videolan.org/?p=ffmpeg.git;a=blob;f=libavutil/x86/x86util.asm;h=cc7d272cada6d73e5ddc42ae359491086a206743;hb=HEAD#l782
>
> Ideally the SSE branch of SPLATD should be also removed and any float
> code using it should be changed to use the BROADCASTSS macro instead,
> but that's perhaps going a bit off-topic.


Re: [FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm

2017-08-04 Thread Henrik Gramner
On Thu, Aug 3, 2017 at 11:36 PM, Ivan Kalvachev  wrote:
>> 1234_1234_1234_123
>> VBROADCASTSS ym1, xm1
>> BLENDVPS  m1, m2, m3
>>
>> is the most commonly used alignment.
>
> I see that a lot of .asm files use different alignments.
> I'll try to pick something similar that I find good looking.
>
> I also see that different function/macro can have
> different position for the first comma.
> This could be quite useful, to visually separate
> the macros.

Things like indentation can be a bit inconsistent at times, yes. Cody
by different authors written over the span of many year etc.

It's normal to diverge from "standard alignment" in cases when macro
names (or memory addresses) are fairly long. Otherwise the amount of
whitespace between instructions and operands would easily get too
large to be practical.

>> Now that I think about it I believe that part wasn't merged because
>> smartalign is broken on some older nasm versions (which is another
>> reason for avoiding things like this in individual files) and FFmpeg
>> doesn't enforce any specific version requirements. I guess it could be
>> worked around with some version checking ifdeffery if we know which
>> versions are affected.
>
> I don't see anything relevant in here:
> http://repo.or.cz/nasm.git/history/HEAD:/macros/smartalign.mac
> I also didn't notice anything similar in the changelog.

http://repo.or.cz/nasm.git/commit/f29123b936b1751c6d72dad86b05eb293cca5c6c
https://bugzilla.nasm.us/show_bug.cgi?id=3392319

> I had it mixed before, but I changed it to be consistent.
> Some new avx instruction are not overloaded or
> available without their "v-" prefix.
> e.g. x86util uses vpbroadcastw in SPLATW.

Every instruction that has both a legacy encoding and a VEX-encoding
will be automatically converted to VEX in AVX functions when written
without the v-prefix. There is no legacy encoding for newer
instructions so there's no reason for x86inc to overload those.

> Isn't one of the selling points for x86inc that
> it would warn if unsupported instructions are used?

Not really, no. It's done in a small subset of cases as a part of the
aforementioned auto legacy/VEX conversion because it was trivial to
add that functionality, but correctly tracking every single x86
instruction in existence would be a whole different thing.

Some instructions requires different instruction sets not just
depending on the vector width but the operand types as well (for
example, pextrw from mm to gpr requires mmx2, from xmm to gpr requires
sse2, from xmm to memory requires sse4.1) and there's no generic way
to handle every special corner case which would make things fairly
complex. I'm guessing doing so would also cause a significant
reduction in compilation speed which I'm sure some people would
dislike.

>> Add your improvements to the existing x86util macro (can be done in
>> the same patch) and then use that.
>
> FFmpeg guidelines request that it is a separate patch in separate mail.

Doesn't really seem to be followed historically by looking at the git
history, but doing things by the book is never wrong so making it a
separate commit might be the correct choice.

> Just renaming it is not enough, because SPLATD
> works differently (it can broadcast second, third or forth element too)
> and it is already used in existing code.
> Maybe PBROADCASTD macro could use SPLATD internally.

SPLATD is hardcoded to broadcast the lowest element.

http://git.videolan.org/?p=ffmpeg.git;a=blob;f=libavutil/x86/x86util.asm;h=cc7d272cada6d73e5ddc42ae359491086a206743;hb=HEAD#l782

Ideally the SSE branch of SPLATD should be also removed and any float
code using it should be changed to use the BROADCASTSS macro instead,
but that's perhaps going a bit off-topic.

> "cmpps" is not scalar, but it is handled by x86inc macros and converted
> to "vcmpps", so no SSE penalties are possible with it.
>
> Read again what I wrote previously.

'cmpps' is packed, not scalar (hence the 'p' instead of 's') so it
obviously shouldn't be used in scalar code. I believe you misread my
initial comment which said to use 'cmpss' instead of 'cmpless'.

>> One could of course argue whether or not x86inc should deal with
>> psuedo-instructions but right now it doesn't so it will cause AVX
>> state transitions if INIT_YMM is used since it wont be VEX-encoded.
>
> Let's say that using a number for selecting comparison mode
> is NOT developer friendly.

I did indeed say one can argue about it. I can see if there's any
clean way of handling it that doesn't require a hundred lines of code.

>> Doing PIC-mangling optimally will in many cases be context-dependent
>> which is probably why it doesn't already exist. For example if you
>> need to do multiple complex PIC-loads you should do a single lea and
>> reuse it which is harder to convey as a generic macro.
>
> I think that this macro can help with that too:
> e.g.
> SET_PIC_BASE lea,  r5, const_1
> movapsm1,  

Re: [FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm

2017-08-03 Thread Ivan Kalvachev
On 8/2/17, Henrik Gramner  wrote:
> On Tue, Aug 1, 2017 at 11:46 PM, Ivan Kalvachev 
> wrote:
>> On 7/31/17, Henrik Gramner  wrote:
>>> Use rN instead of rNq for numbered registers (q suffix is used for
>>> named args only due to preprocessor limitations).
>>
>> Is this documented?
>
> Not sure, but there's probably a bunch small things like this that
> aren't really well documented.
>
>>> Use the same "standard" vertical alignment rules as most existing
>>> code, e.g. instructions indented by 4 spaces and operands aligned on
>>> the first comma.
>>
>> What do you mean operands aligned on the first comma?
>> The longest operand I had is "xmm0" , counting comma and space
>> I get 6 position alignment for operands.
>> Now, with "xm7" i can lower this to 5 position. Should I do that?
>> (I don't have "xm15" ).
>>
>
> 1234_1234_1234_123
> VBROADCASTSS ym1, xm1
> BLENDVPS  m1, m2, m3
>
> is the most commonly used alignment.

I see that a lot of .asm files use different alignments.
I'll try to pick something similar that I find good looking.

I also see that different function/macro can have
different position for the first comma.
This could be quite useful, to visually separate
the macros.


>>> Unless aligning every single loop entry helps a lot I'd avoid it since
>>> it does waste a bit of icache.
>>
>> I'll redo my benchmarks, but I have placed these aligns for a reason.
>> At some point removed debug instructions started making my code
>> slower. So I've placed align to avoid random slowdowns.
>
> Ok.
>
>>> Explicitly using the carry flag as a branch condition is a bit weird.
>>> Generally using jg/jl/jge/jle tends to be clearer.
>>
>> I use it to take advantage of the so called MacroFusion.
>> It allows the merge of cmp and jxx, as long as the branch
>> checks only one flag, so all signed branches are to be avoided
>> (as stated by intel manual).
>> The newer the cpu, the more opcodes (add/sub)
>> could be merged and less limitations.
>
> Every µarch that can do macro-op fusion with add/sub can do so with
> both signed and unsigned branch conditions.
> There's actually only a single µarch that can fuse cmp with unsigned
> but not signed branch conditions and that's more than a decade old
> (Conroe), but if you want to check if (unsigned)a < (unsigned)b 'jb'
> is preferred of 'jc' (both produce the exact same opcode).

One is enough reason for me :}
Since the same thing works just as well or better on newer CPU's.

Also, using above/bellow on arithmetic operations
is a lot more confusing than carry/borrow.
You can see that I use "jc" and "jnc" after "sub".

I'll keep it as it is, unless you have solid reason to change it.

>>> Assembler-specific ifdeffery should be avoided in individual files.
>>> Something equivalent already exists in x86inc actually but I don't
>>> remember if it was merged to FFmpeg from upstream (x264) yet.
>>
>> Definitely not merged.
>>
>> I hear a lot about the improved x86inc,
>> maybe it is time for you to merge it in FFmpeg?
>
> Now that I think about it I believe that part wasn't merged because
> smartalign is broken on some older nasm versions (which is another
> reason for avoiding things like this in individual files) and FFmpeg
> doesn't enforce any specific version requirements. I guess it could be
> worked around with some version checking ifdeffery if we know which
> versions are affected.

I don't see anything relevant in here:
http://repo.or.cz/nasm.git/history/HEAD:/macros/smartalign.mac
I also didn't notice anything similar in the changelog.

I don't think this is relevant reason for not merging avx2 support. ;)

>>> Isn't this just an overly complicated way of expressing "dd 0*4, 1*4,
>>> 2*4, 3*4, 4*4, 5*4, 6*4, 7*4"?
>>
>> Yes.
>> Do you know a way that works with "times 8"?
>> I've done it this way to make it easy to change the size of the constant.
>>
>> Do you request I change it?
>
> I'd prefer just using that simple one-liner I posted. Replacing the
> numbers if you want to change things later is trivial.

It might be trivial, but it also makes it easier to misspell something.
I'd prefer to keep it as it is.

>>> Is reordering moves for the non-AVX path worth the additional
>>> complexity? Microoptimizations that only affect legacy hardware are
>>> IMO a bit questionable.
>>
>> Yes, I'm on "legacy" hardware and the improvement is reliably measurable.
>
> Ok, in that case you should also use shufps/addps instead of
> vshufps/vaddps in the AVX branch for cosmetic reasons though (x86inc
> performs automatic VEX-encoding).

Only on instructions it knows...
that brings us again to the AVX2.

I had it mixed before, but I changed it to be consistent.
Some new avx instruction are not overloaded or
available without their "v-" prefix.
e.g. x86util uses vpbroadcastw in SPLATW.

It makes more sense to use "v" prefix for
code that is guaranteed to be avx.

>>> The AVX1 integer warning is 

Re: [FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm

2017-08-02 Thread Henrik Gramner
On Tue, Aug 1, 2017 at 11:46 PM, Ivan Kalvachev  wrote:
> On 7/31/17, Henrik Gramner  wrote:
>> Use rN instead of rNq for numbered registers (q suffix is used for
>> named args only due to preprocessor limitations).
>
> Is this documented?

Not sure, but there's probably a bunch small things like this that
aren't really well documented.

>> Use the same "standard" vertical alignment rules as most existing
>> code, e.g. instructions indented by 4 spaces and operands aligned on
>> the first comma.
>
> What do you mean operands aligned on the first comma?
> The longest operand I had is "xmm0" , counting comma and space
> I get 6 position alignment for operands.
> Now, with "xm7" i can lower this to 5 position. Should I do that?
> (I don't have "xm15" ).
>

1234_1234_1234_123
VBROADCASTSS ym1, xm1
BLENDVPS  m1, m2, m3

is the most commonly used alignment.

>> Unless aligning every single loop entry helps a lot I'd avoid it since
>> it does waste a bit of icache.
>
> I'll redo my benchmarks, but I have placed these aligns for a reason.
> At some point removed debug instructions started making my code
> slower. So I've placed align to avoid random slowdowns.

Ok.

>> Explicitly using the carry flag as a branch condition is a bit weird.
>> Generally using jg/jl/jge/jle tends to be clearer.
>
> I use it to take advantage of the so called MacroFusion.
> It allows the merge of cmp and jxx, as long as the branch
> checks only one flag, so all signed branches are to be avoided
> (as stated by intel manual).
> The newer the cpu, the more opcodes (add/sub)
> could be merged and less limitations.

Every µarch that can do macro-op fusion with add/sub can do so with
both signed and unsigned branch conditions.

There's actually only a single µarch that can fuse cmp with unsigned
but not signed branch conditions and that's more than a decade old
(Conroe), but if you want to check if (unsigned)a < (unsigned)b 'jb'
is preferred of 'jc' (both produce the exact same opcode).

>> Assembler-specific ifdeffery should be avoided in individual files.
>> Something equivalent already exists in x86inc actually but I don't
>> remember if it was merged to FFmpeg from upstream (x264) yet.
>
> Definitely not merged.
>
> I hear a lot about the improved x86inc,
> maybe it is time for you to merge it in FFmpeg?

Now that I think about it I believe that part wasn't merged because
smartalign is broken on some older nasm versions (which is another
reason for avoiding things like this in individual files) and FFmpeg
doesn't enforce any specific version requirements. I guess it could be
worked around with some version checking ifdeffery if we know which
versions are affected.

>> Isn't this just an overly complicated way of expressing "dd 0*4, 1*4,
>> 2*4, 3*4, 4*4, 5*4, 6*4, 7*4"?
>
> Yes.
> Do you know a way that works with "times 8"?
> I've done it this way to make it easy to change the size of the constant.
>
> Do you request I change it?

I'd prefer just using that simple one-liner I posted. Replacing the
numbers if you want to change things later is trivial.

>> Is reordering moves for the non-AVX path worth the additional
>> complexity? Microoptimizations that only affect legacy hardware are
>> IMO a bit questionable.
>
> Yes, I'm on "legacy" hardware and the improvement is reliably measurable.

Ok, in that case you should also use shufps/addps instead of
vshufps/vaddps in the AVX branch for cosmetic reasons though (x86inc
performs automatic VEX-encoding).

>> The AVX1 integer warning is kind of silly and doesn't really serve any
>> purpose.
>
> As I've said in this thread before,
> If you use avx1 ymm with this macro
> you won't get any warning that avx2 opcode has been generated
> because x86inc does not overload avx2.

Sure, but using any other AVX2 instruction in AVX functions wont
result in any warnings either so it doesn't make much sense to do it
in this specific case.

>> Use the existing x86util VBROADCASTSS macro. Modify it if it's lacking
>> something.
>
> The x86util VBROADCASTSS macro implements only
> the avx1 variant that is memory to register.
> My variant emulates the avx2 variant,
> that also does reg to reg.
>
> My variant also avoids write dependency on "movss reg, reg".
> ("movss reg, mem" clears the other elements, but
> "movss reg, reg" preserves them. This creates dependency on the old
> values of the register.)
>
> And yes, I did ask if I should separate
> these macros in another file and
> if I should try to merge them in x86util.asm first.
>
> What do you think on the matter?

Add your improvements to the existing x86util macro (can be done in
the same patch) and then use that.

>> +%macro EMU_pbroadcastd 2 ; dst, src
>> +%if cpuflag(avx2)
>> +   vpbroadcastd %1,   %2
>> +%elif cpuflag(avx) && mmsize >= 32
>> +%error AVX1 does not support integer 256 bit ymm operations
>> +%else
>> +  %ifnum sizeof%2   ; sse2 register
>> +

Re: [FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm

2017-08-01 Thread Ivan Kalvachev
On 7/31/17, Henrik Gramner  wrote:
> On Wed, Jul 26, 2017 at 4:56 PM, Ivan Kalvachev 
> wrote:
>> +++ b/libavcodec/x86/opus_pvq_search.asm
>
> Generic minor stuff:
>
> Use rN instead of rNq for numbered registers (q suffix is used for
> named args only due to preprocessor limitations).

Done.

Is this documented?

> Use the same "standard" vertical alignment rules as most existing
> code, e.g. instructions indented by 4 spaces and operands aligned on
> the first comma.

What do you mean operands aligned on the first comma?
The longest operand I had is "xmm0" , counting comma and space
I get 6 position alignment for operands.
Now, with "xm7" i can lower this to 5 position. Should I do that?
(I don't have "xm15" ).

Also, I've aligned the operand at 12 positions after their
instruction start, because this is the size of the longest
real instruction.

As for why i have instruction start at 8'th position.
It's because I've allocated the field at position 4-7
for instruction prefixes, and the "EMU_" is 4 positions.

Now I have:
1234567812345_12345_12345_
EMU_broadcastss ym13, xm10
EMU_blendvpsxm1,  xm2,  m3
   vblendvps
blendvps

I can ditch the prefix and shorten the operands. e.g. :
1234_1234_1234_1234_
VBROADCASTSS ym1, xm1
BLENDVPS m1,  m2,  m3

Is that acceptable? Or maybe you do mean:

1234_1234_1234_123
VBROADCASTSS ym1, xm1
BLENDVPS  m1, m2, m3

However I would prefer to use
1234_1234_1234_123
VBROADCASTSS ym1, xm1
BLENDVPS  m1,  m2,  m3
PBLENDVD xm1, xm2, xm3

(You do need fixed width font to see that correctly).

I'll wait for reply before doing that.

> Use xmN instead of xmmN (only really makes a difference when SWAP:s
> are used, but might as well do it "correctly").

Done.

> Use 32-bit operands, e.g. rNd, when 64-bit math isn't required.

Done

> Unless aligning every single loop entry helps a lot I'd avoid it since
> it does waste a bit of icache.

I'll redo my benchmarks, but I have placed these aligns for a reason.
At some point removed debug instructions started making my code
slower. So I've placed align to avoid random slowdowns.

> Explicitly using the carry flag as a branch condition is a bit weird.
> Generally using jg/jl/jge/jle tends to be clearer.

I use it to take advantage of the so called MacroFusion.
It allows the merge of cmp and jxx, as long as the branch
checks only one flag, so all signed branches are to be avoided
(as stated by intel manual).
The newer the cpu, the more opcodes (add/sub)
could be merged and less limitations.

>> +%ifdef __NASM_VER__
>> +%use "smartalign"
>> +ALIGNMODE p6
>> +%endif
>
> Assembler-specific ifdeffery should be avoided in individual files.
> Something equivalent already exists in x86inc actually but I don't
> remember if it was merged to FFmpeg from upstream (x264) yet.

Definitely not merged.

I hear a lot about the improved x86inc,
maybe it is time for you to merge it in FFmpeg?


>> +const_int32_offsets:
>> +%rep 8
>> +dd $-const_int32_offsets
>> +%endrep
>
> Isn't this just an overly complicated way of expressing "dd 0*4, 1*4,
> 2*4, 3*4, 4*4, 5*4, 6*4, 7*4"?

Yes.
Do you know a way that works with "times 8"?
I've done it this way to make it easy to change the size of the constant.

Do you request I change it?

>> +SECTION .text
>> +
>> +
>> +
>> +
>
> Reduce some of the excessive whitespace.

Will do with the other cosmetics.

>> +%macro HSUMPS 2 ; dst/src, tmp
>> +%if cpuflag(avx)
>> +  %if sizeof%1>=32  ; avx
>> +   vperm2f128   %2,   %1,   %1,   (0)*16+(1)   ;
>> %2=lo(%1)<<128+hi(%1)
>> +   vaddps   %1,   %2
>> +  %endif
>> +   vshufps  %2,   %1,   %1,   q1032
>> +   vaddps   %1,   %2
>> +   vshufps  %2,   %1,   %1,   q0321
>> +   vaddps   %1,   %2
>> +
>> +%else  ; this form is a bit faster than the short avx-like emulation.
>> +movaps  %2,   %1;[d,   c,   b,
>> a   ]
>> +shufps  %1,   %1,   q1032   ; %2=[b,   a,   d,
>> c   ]
>> +addps   %1,   %2; %1=[b+d, a+c, d+b,
>> c+a ]
>> +movaps  %2,   %1
>> +shufps  %1,   %1,   q0321   ; %2=[c+a, b+d, a+c,
>> d+b ]
>> +addps   %1,   %2; %1=[c+a+b+d, b+d+a+c, a+c+d+b,
>> d+b+c+a ]
>> +; all %1 members should be equal for as long as float a+b==b+a
>> +%endif
>> +%endmacro
>
> Is reordering moves for the non-AVX path worth the additional
> complexity? Microoptimizations that only affect legacy hardware are
> IMO a bit questionable.

Yes, I'm on "legacy" hardware and the improvement is reliably measurable.

>> +%macro EMU_pblendvb 3 ; dst/src_a, src_b, mask
>> +%if cpuflag(avx)
>> +  %if cpuflag(avx) && notcpuflag(avx2) && sizeof%1 >= 32
>> +%error 

Re: [FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm

2017-07-31 Thread Henrik Gramner
On Wed, Jul 26, 2017 at 4:56 PM, Ivan Kalvachev  wrote:
> +++ b/libavcodec/x86/opus_pvq_search.asm

Generic minor stuff:

Use rN instead of rNq for numbered registers (q suffix is used for
named args only due to preprocessor limitations).

Use the same "standard" vertical alignment rules as most existing
code, e.g. instructions indented by 4 spaces and operands aligned on
the first comma.

Use xmN instead of xmmN (only really makes a difference when SWAP:s
are used, but might as well do it "correctly").

Use 32-bit operands, e.g. rNd, when 64-bit math isn't required.

Unless aligning every single loop entry helps a lot I'd avoid it since
it does waste a bit of icache.

Explicitly using the carry flag as a branch condition is a bit weird.
Generally using jg/jl/jge/jle tends to be clearer.

> +%ifdef __NASM_VER__
> +%use "smartalign"
> +ALIGNMODE p6
> +%endif

Assembler-specific ifdeffery should be avoided in individual files.
Something equivalent already exists in x86inc actually but I don't
remember if it was merged to FFmpeg from upstream (x264) yet.

> +const_int32_offsets:
> +%rep 8
> +dd $-const_int32_offsets
> +%endrep

Isn't this just an overly complicated way of expressing "dd 0*4, 1*4,
2*4, 3*4, 4*4, 5*4, 6*4, 7*4"?

> +SECTION .text
> +
> +
> +
> +

Reduce some of the excessive whitespace.

> +%macro HSUMPS 2 ; dst/src, tmp
> +%if cpuflag(avx)
> +  %if sizeof%1>=32  ; avx
> +   vperm2f128   %2,   %1,   %1,   (0)*16+(1)   ; %2=lo(%1)<<128+hi(%1)
> +   vaddps   %1,   %2
> +  %endif
> +   vshufps  %2,   %1,   %1,   q1032
> +   vaddps   %1,   %2
> +   vshufps  %2,   %1,   %1,   q0321
> +   vaddps   %1,   %2
> +
> +%else  ; this form is a bit faster than the short avx-like emulation.
> +movaps  %2,   %1;[d,   c,   b,   a   
> ]
> +shufps  %1,   %1,   q1032   ; %2=[b,   a,   d,   c   
> ]
> +addps   %1,   %2; %1=[b+d, a+c, d+b, c+a 
> ]
> +movaps  %2,   %1
> +shufps  %1,   %1,   q0321   ; %2=[c+a, b+d, a+c, d+b 
> ]
> +addps   %1,   %2; %1=[c+a+b+d, b+d+a+c, a+c+d+b, 
> d+b+c+a ]
> +; all %1 members should be equal for as long as float a+b==b+a
> +%endif
> +%endmacro

Is reordering moves for the non-AVX path worth the additional
complexity? Microoptimizations that only affect legacy hardware are
IMO a bit questionable.

> +%macro EMU_pblendvb 3 ; dst/src_a, src_b, mask
> +%if cpuflag(avx)
> +  %if cpuflag(avx) && notcpuflag(avx2) && sizeof%1 >= 32
> +%error AVX1 does not support integer 256 bit ymm operations
> +  %endif
> +
> +   vpblendvb%1,   %1,   %2,   %3
> +;-
> +%elif cpuflag(sse4)
> +%ifnidn %3,xmm0
> +  %error sse41 blendvps uses xmm0 as default 3 operand, you used %3
> +%endif
> +pblendvb%1,   %2,   %3
> +;-
> +%else
> +pxor%2,   %1
> +pand%2,   %3
> +pxor%1,   %2
> +%endif
> +%endmacro

The AVX1 integer warning is kind of silly and doesn't really serve any purpose.

> +%macro EMU_broadcastss 2 ; dst, src
> +%if cpuflag(avx2)
> +   vbroadcastss %1,   %2; ymm, xmm
> +;-
> +%elif cpuflag(avx)
> +  %ifnum sizeof%2   ; avx1 register
> +   vpermilpsxmm%1, xmm%2, q ; xmm, xmm, imm || ymm, ymm, imm
> +%if sizeof%1 >= 32  ; mmsize>=32
> +   vinsertf128  %1,   %1,xmm%1,   1 ; ymm, ymm, xmm, im
> +%endif
> +  %else ; avx1 memory
> +   vbroadcastss %1,   %2; ymm, mm32 || xmm, mm32
> +  %endif
> +;-
> +%else
> +  %ifnum sizeof%2   ; sse register
> +shufps  %1,   %2,   %2,   q
> +  %else ; sse memory
> +movss   %1,   %2; this zeroes the other 3 elements
> +shufps  %1,   %1,   0
> +  %endif
> +%endif
> +%endmacro

Use the existing x86util VBROADCASTSS macro. Modify it if it's lacking
something.

+%macro EMU_pbroadcastd 2 ; dst, src
+%if cpuflag(avx2)
+   vpbroadcastd %1,   %2
+%elif cpuflag(avx) && mmsize >= 32
+%error AVX1 does not support integer 256 bit ymm operations
+%else
+  %ifnum sizeof%2   ; sse2 register
+pshufd  %1,   %2,   q
+  %else ; sse memory
+movd%1,   %2; movd zero extends
+pshufd  %1,   %1,   0
+  %endif
+%endif
+%endmacro

Same as above, but using the SPLATD macro.

> +; Merge parallel maximums final round (1 vs 1)
> +movaps  xmm0, xmm3  ; m0 = m3[0] = p[0]
> +shufps  xmm3, xmm3, q   ; m3 = m3[1] = p[1]
> +cmpless xmm0, xmm3

Use 3-arg instead of reg-reg movaps.

Use the actual cmpss instruction instead 

Re: [FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm

2017-07-27 Thread Ivan Kalvachev
On 7/27/17, Rostislav Pehlivanov  wrote:
> On 27 July 2017 at 09:38, Ivan Kalvachev  wrote:
>
>> On 7/27/17, Rostislav Pehlivanov  wrote:
>> > On 26 July 2017 at 15:56, Ivan Kalvachev  wrote:
>> >
>> >> +if (ARCH_X86 && CONFIG_OPUS_ENCODER)
>> >> +ff_opus_dsp_init_x86(s);
>> >>
>> >
>> > Just change it to
>> > +if (ARCH_X86)
>> >
>> > The init function is named opus_dsp, so it'll get used to other opus
>> > things, not just the encoder.
>>
>> But at the moment it does not.
>> I do prefer to leave that task for the one that
>> adds opus decoder functions.
>>
>> Also this change alone would break compilation, since
>> it also requires changing the libavcodec/x86/Makefile
>> and adding the guard inside the opus_dsp_init.c
>>
>> Another option is to have "opus_enc_dsp_init.c" and call
>> the function "ff_opus_enc_dsp_init_x86()".
>>
>> Do tell me which option do you prefer
>> and do you insist on v7 just for that.
>>
>> > The assembly code looks fine to me, but other people will have to take a
>> > look at it in case I'm missing something.
>
> The former, but that can be changed later after pushing

Here is the patch.
I'll merge it in v7, if there is one.

Please note that makefile needs to use two separate
config_opus_decoder/encoder, since there is no config_opus_codec . All
other dsp seem to use separate files for encoder and decoder dsp.

Best Regards.
From 1a6ee9b2880c67db25737a6317f09cbbac441c83 Mon Sep 17 00:00:00 2001
From: Ivan Kalvachev 
Date: Thu, 27 Jul 2017 14:21:33 +0300
Subject: [PATCH 2/6] Build opus dsp for encoder and decoder.

---
 libavcodec/opus_pvq.c  | 2 +-
 libavcodec/x86/Makefile| 1 +
 libavcodec/x86/opus_dsp_init.c | 2 ++
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/libavcodec/opus_pvq.c b/libavcodec/opus_pvq.c
index 3aa502929c..2fb276099b 100644
--- a/libavcodec/opus_pvq.c
+++ b/libavcodec/opus_pvq.c
@@ -947,7 +947,7 @@ int av_cold ff_celt_pvq_init(CeltPVQ **pvq)
 s->encode_band= pvq_encode_band;
 s->band_cost  = pvq_band_cost;
 
-if (ARCH_X86 && CONFIG_OPUS_ENCODER)
+if (ARCH_X86)
 ff_opus_dsp_init_x86(s);
 
 *pvq = s;
diff --git a/libavcodec/x86/Makefile b/libavcodec/x86/Makefile
index 9875f48797..e36644c72a 100644
--- a/libavcodec/x86/Makefile
+++ b/libavcodec/x86/Makefile
@@ -52,6 +52,7 @@ OBJS-$(CONFIG_APNG_DECODER)+= x86/pngdsp_init.o
 OBJS-$(CONFIG_CAVS_DECODER)+= x86/cavsdsp.o
 OBJS-$(CONFIG_DCA_DECODER) += x86/dcadsp_init.o x86/synth_filter_init.o
 OBJS-$(CONFIG_DNXHD_ENCODER)   += x86/dnxhdenc_init.o
+OBJS-$(CONFIG_OPUS_DECODER)+= x86/opus_dsp_init.o
 OBJS-$(CONFIG_OPUS_ENCODER)+= x86/opus_dsp_init.o
 OBJS-$(CONFIG_HEVC_DECODER)+= x86/hevcdsp_init.o
 OBJS-$(CONFIG_JPEG2000_DECODER)+= x86/jpeg2000dsp_init.o
diff --git a/libavcodec/x86/opus_dsp_init.c b/libavcodec/x86/opus_dsp_init.c
index f4c25822db..c51f786ee8 100644
--- a/libavcodec/x86/opus_dsp_init.c
+++ b/libavcodec/x86/opus_dsp_init.c
@@ -32,6 +32,7 @@ av_cold void ff_opus_dsp_init_x86(CeltPVQ *s)
 {
 int cpu_flags = av_get_cpu_flags();
 
+#if CONFIG_OPUS_ENCODER
 if (EXTERNAL_SSE2(cpu_flags))
 s->pvq_search = ff_pvq_search_sse2;
 
@@ -40,4 +41,5 @@ av_cold void ff_opus_dsp_init_x86(CeltPVQ *s)
 
 if (EXTERNAL_AVX(cpu_flags))
 s->pvq_search = ff_pvq_search_avx;
+#endif
 }
-- 
2.13.2

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm

2017-07-27 Thread Rostislav Pehlivanov
On 27 July 2017 at 09:38, Ivan Kalvachev  wrote:

> On 7/27/17, Rostislav Pehlivanov  wrote:
> > On 26 July 2017 at 15:56, Ivan Kalvachev  wrote:
> >
> >> +if (ARCH_X86 && CONFIG_OPUS_ENCODER)
> >> +ff_opus_dsp_init_x86(s);
> >>
> >
> > Just change it to
> > +if (ARCH_X86)
> >
> > The init function is named opus_dsp, so it'll get used to other opus
> > things, not just the encoder.
>
> But at the moment it does not.
> I do prefer to leave that task for the one that
> adds opus decoder functions.
>
> Also this change alone would break compilation, since
> it also requires changing the libavcodec/x86/Makefile
> and adding the guard inside the opus_dsp_init.c
>
> Another option is to have "opus_enc_dsp_init.c" and call
> the function "ff_opus_enc_dsp_init_x86()".
>
> Do tell me which option do you prefer
> and do you insist on v7 just for that.
>
> > The assembly code looks fine to me, but other people will have to take a
> > look at it in case I'm missing something.
> ___
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>

The former, but that can be changed later after pushing
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm

2017-07-27 Thread Ivan Kalvachev
On 7/27/17, Rostislav Pehlivanov  wrote:
> On 26 July 2017 at 15:56, Ivan Kalvachev  wrote:
>
>> +if (ARCH_X86 && CONFIG_OPUS_ENCODER)
>> +ff_opus_dsp_init_x86(s);
>>
>
> Just change it to
> +if (ARCH_X86)
>
> The init function is named opus_dsp, so it'll get used to other opus
> things, not just the encoder.

But at the moment it does not.
I do prefer to leave that task for the one that
adds opus decoder functions.

Also this change alone would break compilation, since
it also requires changing the libavcodec/x86/Makefile
and adding the guard inside the opus_dsp_init.c

Another option is to have "opus_enc_dsp_init.c" and call
the function "ff_opus_enc_dsp_init_x86()".

Do tell me which option do you prefer
and do you insist on v7 just for that.

> The assembly code looks fine to me, but other people will have to take a
> look at it in case I'm missing something.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm

2017-07-26 Thread Rostislav Pehlivanov
On 26 July 2017 at 15:56, Ivan Kalvachev  wrote:

> +if (ARCH_X86 && CONFIG_OPUS_ENCODER)
> +ff_opus_dsp_init_x86(s);
>

Just change it to
+if (ARCH_X86)

The init function is named opus_dsp, so it'll get used to other opus
things, not just the encoder.


The assembly code looks fine to me, but other people will have to take a
look at it in case I'm missing something.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel