Re: [FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm
On Sat, Aug 5, 2017 at 12:58 AM, Ivan Kalvachevwrote: > 8 packed, 8 scalar. > > Unless I miss something (and as I've said before, > I'm not confident enough to mess with that code.) > > (AVX does extend to 32 variants, but they are not > SSE compatible, so no need to emulate them.) Oh, right. I quickly glanced at the docs and saw 32 pseudo-ops for each instruction for a total of 128 when adding pd, ps, sd, ss, but the fact that only the first 8 is relevant here reduces it to 32 which is a lot more manageable. > movaps m1, [WRT_PIC_BASE + const_2 + r2 ] > > Looks better. (Also not tested. Will do, later.) I intentionally used the WRT define at the end because that's most similar to the built in wrt syntax used when accessing symbols through the PLT or GOT, e.g. mov eax, [external_symbol wrt ..got] > Yeh $$ is the start of the current section, and that's is going to be > ".text" not "rodata". Obviously, yes. You need a reference that results in a compile-time constant PC-offset (which .rodata isn't) to create PC-relative relocation records to external symbols. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm
On 8/5/17, Hendrik Leppkeswrote: > On Sat, Aug 5, 2017 at 12:21 PM, Ivan Kalvachev > wrote: >> On 8/5/17, Rostislav Pehlivanov wrote: >>> On 4 August 2017 at 23:58, Ivan Kalvachev wrote: >>> >> I had it mixed before, but I changed it to be consistent. >> Some new avx instruction are not overloaded or >> available without their "v-" prefix. >> e.g. x86util uses vpbroadcastw in SPLATW. > > Every instruction that has both a legacy encoding and a VEX-encoding > will be automatically converted to VEX in AVX functions when written > without the v-prefix. There is no legacy encoding for newer > instructions so there's no reason for x86inc to overload those. That's great, but it doesn't address the original problem. Do you insist on removing the "v-" prefix from AVX only instructions? >>> I insist. >> >> If you insist, then you should have a good technical reason for it. >> Something that is not based on looks, cosmetics or "it has always been >> like this". >> >> For example, slow compilation because of unnecessary macro expansion is >> a technical reason to keep the "v-" prefix. >> > > Consistency with the typical style of our codebase is also a reason. No, it is not. This fall in category of doing things wrong, because we have always done them wrong. Also you brought up the matter of compilation speed, now you don't care about it? ;) Don't worry, I will do the requested change, only because in future it might help with avx1 & ymm integer test. Please answer my other question(s). ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm
On Sat, Aug 5, 2017 at 12:21 PM, Ivan Kalvachevwrote: > On 8/5/17, Rostislav Pehlivanov wrote: >> On 4 August 2017 at 23:58, Ivan Kalvachev wrote: >> >>> >>> >> I had it mixed before, but I changed it to be consistent. >>> >> Some new avx instruction are not overloaded or >>> >> available without their "v-" prefix. >>> >> e.g. x86util uses vpbroadcastw in SPLATW. >>> > >>> > Every instruction that has both a legacy encoding and a VEX-encoding >>> > will be automatically converted to VEX in AVX functions when written >>> > without the v-prefix. There is no legacy encoding for newer >>> > instructions so there's no reason for x86inc to overload those. >>> >>> That's great, but it doesn't address the original problem. >>> Do you insist on removing the "v-" prefix from AVX only instructions? >>> >>> >> I insist. > > If you insist, then you should have a good technical reason for it. > Something that is not based on looks, cosmetics or "it has always been > like this". > > For example, slow compilation because of unnecessary macro expansion is > a technical reason to keep the "v-" prefix. > Consistency with the typical style of our codebase is also a reason. - Hendrik ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm
On 8/5/17, Rostislav Pehlivanovwrote: > On 4 August 2017 at 23:58, Ivan Kalvachev wrote: > >> >> >> I had it mixed before, but I changed it to be consistent. >> >> Some new avx instruction are not overloaded or >> >> available without their "v-" prefix. >> >> e.g. x86util uses vpbroadcastw in SPLATW. >> > >> > Every instruction that has both a legacy encoding and a VEX-encoding >> > will be automatically converted to VEX in AVX functions when written >> > without the v-prefix. There is no legacy encoding for newer >> > instructions so there's no reason for x86inc to overload those. >> >> That's great, but it doesn't address the original problem. >> Do you insist on removing the "v-" prefix from AVX only instructions? >> >> > I insist. If you insist, then you should have a good technical reason for it. Something that is not based on looks, cosmetics or "it has always been like this". For example, slow compilation because of unnecessary macro expansion is a technical reason to keep the "v-" prefix. ;) ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm
On 4 August 2017 at 23:58, Ivan Kalvachevwrote: > > >> I had it mixed before, but I changed it to be consistent. > >> Some new avx instruction are not overloaded or > >> available without their "v-" prefix. > >> e.g. x86util uses vpbroadcastw in SPLATW. > > > > Every instruction that has both a legacy encoding and a VEX-encoding > > will be automatically converted to VEX in AVX functions when written > > without the v-prefix. There is no legacy encoding for newer > > instructions so there's no reason for x86inc to overload those. > > That's great, but it doesn't address the original problem. > Do you insist on removing the "v-" prefix from AVX only instructions? > > I insist. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm
On 8/4/17, Henrik Gramnerwrote: > On Thu, Aug 3, 2017 at 11:36 PM, Ivan Kalvachev > wrote: >>> 1234_1234_1234_123 >>> VBROADCASTSS ym1, xm1 >>> BLENDVPS m1, m2, m3 >>> >>> is the most commonly used alignment. >> >> I see that a lot of .asm files use different alignments. >> I'll try to pick something similar that I find good looking. >> >> I also see that different function/macro can have >> different position for the first comma. >> This could be quite useful, to visually separate >> the macros. > > Things like indentation can be a bit inconsistent at times, yes. Cody > by different authors written over the span of many year etc. > > It's normal to diverge from "standard alignment" in cases when macro > names (or memory addresses) are fairly long. Otherwise the amount of > whitespace between instructions and operands would easily get too > large to be practical. I tried different things. The one space after comma indeed looks best. I used 2 spaces after instruction. >>> Now that I think about it I believe that part wasn't merged because >>> smartalign is broken on some older nasm versions (which is another >>> reason for avoiding things like this in individual files) and FFmpeg >>> doesn't enforce any specific version requirements. I guess it could be >>> worked around with some version checking ifdeffery if we know which >>> versions are affected. >> >> I don't see anything relevant in here: >> http://repo.or.cz/nasm.git/history/HEAD:/macros/smartalign.mac >> I also didn't notice anything similar in the changelog. > http://repo.or.cz/nasm.git/commit/f29123b936b1751c6d72dad86b05eb293cca5c6c > https://bugzilla.nasm.us/show_bug.cgi?id=3392319 This would break generation of dependencies? I think that FFmpeg uses combined mode with nasm, so it should not call preprocessor only mode, unless DBG=1 is used. I'll investigate this further. (revert the nasm fix, try building ffmpeg dependencies ...) >> I had it mixed before, but I changed it to be consistent. >> Some new avx instruction are not overloaded or >> available without their "v-" prefix. >> e.g. x86util uses vpbroadcastw in SPLATW. > > Every instruction that has both a legacy encoding and a VEX-encoding > will be automatically converted to VEX in AVX functions when written > without the v-prefix. There is no legacy encoding for newer > instructions so there's no reason for x86inc to overload those. That's great, but it doesn't address the original problem. Do you insist on removing the "v-" prefix from AVX only instructions? You said that this is purely cosmetic and I think that it would make compilation a bit slower, because of unnecessary macro expansions. ;) >> Isn't one of the selling points for x86inc that >> it would warn if unsupported instructions are used? > > Not really, no. It's done in a small subset of cases as a part of the > aforementioned auto legacy/VEX conversion because it was trivial to > add that functionality, but correctly tracking every single x86 > instruction in existence would be a whole different thing. > > Some instructions requires different instruction sets not just > depending on the vector width but the operand types as well (for > example, pextrw from mm to gpr requires mmx2, from xmm to gpr requires > sse2, from xmm to memory requires sse4.1) and there's no generic way > to handle every special corner case which would make things fairly > complex. I'm guessing doing so would also cause a significant > reduction in compilation speed which I'm sure some people would > dislike. There is already check for register size and integer operations, because SSE1 does not have 128bit xmm integer operations. It should be just few lines of code to add the same check for AVX1 and 256bit ymm integer operations. (I'm not confident enough to mess with that code, yet.) As for the immediate task: The "error" message stays. Maybe I should also add "use PBROADCASTD instead" to it? >>> Add your improvements to the existing x86util macro (can be done in >>> the same patch) and then use that. >> >> FFmpeg guidelines request that it is a separate patch in separate mail. > > Doesn't really seem to be followed historically by looking at the git > history, but doing things by the book is never wrong so making it a > separate commit might be the correct choice. Will do. >> Just renaming it is not enough, because SPLATD >> works differently (it can broadcast second, third or forth element too) >> and it is already used in existing code. >> Maybe PBROADCASTD macro could use SPLATD internally. > > SPLATD is hardcoded to broadcast the lowest element. > > http://git.videolan.org/?p=ffmpeg.git;a=blob;f=libavutil/x86/x86util.asm;h=cc7d272cada6d73e5ddc42ae359491086a206743;hb=HEAD#l782 > > Ideally the SSE branch of SPLATD should be also removed and any float > code using it should be changed to use the BROADCASTSS macro instead, > but that's perhaps going a bit off-topic.
Re: [FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm
On Thu, Aug 3, 2017 at 11:36 PM, Ivan Kalvachevwrote: >> 1234_1234_1234_123 >> VBROADCASTSS ym1, xm1 >> BLENDVPS m1, m2, m3 >> >> is the most commonly used alignment. > > I see that a lot of .asm files use different alignments. > I'll try to pick something similar that I find good looking. > > I also see that different function/macro can have > different position for the first comma. > This could be quite useful, to visually separate > the macros. Things like indentation can be a bit inconsistent at times, yes. Cody by different authors written over the span of many year etc. It's normal to diverge from "standard alignment" in cases when macro names (or memory addresses) are fairly long. Otherwise the amount of whitespace between instructions and operands would easily get too large to be practical. >> Now that I think about it I believe that part wasn't merged because >> smartalign is broken on some older nasm versions (which is another >> reason for avoiding things like this in individual files) and FFmpeg >> doesn't enforce any specific version requirements. I guess it could be >> worked around with some version checking ifdeffery if we know which >> versions are affected. > > I don't see anything relevant in here: > http://repo.or.cz/nasm.git/history/HEAD:/macros/smartalign.mac > I also didn't notice anything similar in the changelog. http://repo.or.cz/nasm.git/commit/f29123b936b1751c6d72dad86b05eb293cca5c6c https://bugzilla.nasm.us/show_bug.cgi?id=3392319 > I had it mixed before, but I changed it to be consistent. > Some new avx instruction are not overloaded or > available without their "v-" prefix. > e.g. x86util uses vpbroadcastw in SPLATW. Every instruction that has both a legacy encoding and a VEX-encoding will be automatically converted to VEX in AVX functions when written without the v-prefix. There is no legacy encoding for newer instructions so there's no reason for x86inc to overload those. > Isn't one of the selling points for x86inc that > it would warn if unsupported instructions are used? Not really, no. It's done in a small subset of cases as a part of the aforementioned auto legacy/VEX conversion because it was trivial to add that functionality, but correctly tracking every single x86 instruction in existence would be a whole different thing. Some instructions requires different instruction sets not just depending on the vector width but the operand types as well (for example, pextrw from mm to gpr requires mmx2, from xmm to gpr requires sse2, from xmm to memory requires sse4.1) and there's no generic way to handle every special corner case which would make things fairly complex. I'm guessing doing so would also cause a significant reduction in compilation speed which I'm sure some people would dislike. >> Add your improvements to the existing x86util macro (can be done in >> the same patch) and then use that. > > FFmpeg guidelines request that it is a separate patch in separate mail. Doesn't really seem to be followed historically by looking at the git history, but doing things by the book is never wrong so making it a separate commit might be the correct choice. > Just renaming it is not enough, because SPLATD > works differently (it can broadcast second, third or forth element too) > and it is already used in existing code. > Maybe PBROADCASTD macro could use SPLATD internally. SPLATD is hardcoded to broadcast the lowest element. http://git.videolan.org/?p=ffmpeg.git;a=blob;f=libavutil/x86/x86util.asm;h=cc7d272cada6d73e5ddc42ae359491086a206743;hb=HEAD#l782 Ideally the SSE branch of SPLATD should be also removed and any float code using it should be changed to use the BROADCASTSS macro instead, but that's perhaps going a bit off-topic. > "cmpps" is not scalar, but it is handled by x86inc macros and converted > to "vcmpps", so no SSE penalties are possible with it. > > Read again what I wrote previously. 'cmpps' is packed, not scalar (hence the 'p' instead of 's') so it obviously shouldn't be used in scalar code. I believe you misread my initial comment which said to use 'cmpss' instead of 'cmpless'. >> One could of course argue whether or not x86inc should deal with >> psuedo-instructions but right now it doesn't so it will cause AVX >> state transitions if INIT_YMM is used since it wont be VEX-encoded. > > Let's say that using a number for selecting comparison mode > is NOT developer friendly. I did indeed say one can argue about it. I can see if there's any clean way of handling it that doesn't require a hundred lines of code. >> Doing PIC-mangling optimally will in many cases be context-dependent >> which is probably why it doesn't already exist. For example if you >> need to do multiple complex PIC-loads you should do a single lea and >> reuse it which is harder to convey as a generic macro. > > I think that this macro can help with that too: > e.g. > SET_PIC_BASE lea, r5, const_1 > movapsm1,
Re: [FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm
On 8/2/17, Henrik Gramnerwrote: > On Tue, Aug 1, 2017 at 11:46 PM, Ivan Kalvachev > wrote: >> On 7/31/17, Henrik Gramner wrote: >>> Use rN instead of rNq for numbered registers (q suffix is used for >>> named args only due to preprocessor limitations). >> >> Is this documented? > > Not sure, but there's probably a bunch small things like this that > aren't really well documented. > >>> Use the same "standard" vertical alignment rules as most existing >>> code, e.g. instructions indented by 4 spaces and operands aligned on >>> the first comma. >> >> What do you mean operands aligned on the first comma? >> The longest operand I had is "xmm0" , counting comma and space >> I get 6 position alignment for operands. >> Now, with "xm7" i can lower this to 5 position. Should I do that? >> (I don't have "xm15" ). >> > > 1234_1234_1234_123 > VBROADCASTSS ym1, xm1 > BLENDVPS m1, m2, m3 > > is the most commonly used alignment. I see that a lot of .asm files use different alignments. I'll try to pick something similar that I find good looking. I also see that different function/macro can have different position for the first comma. This could be quite useful, to visually separate the macros. >>> Unless aligning every single loop entry helps a lot I'd avoid it since >>> it does waste a bit of icache. >> >> I'll redo my benchmarks, but I have placed these aligns for a reason. >> At some point removed debug instructions started making my code >> slower. So I've placed align to avoid random slowdowns. > > Ok. > >>> Explicitly using the carry flag as a branch condition is a bit weird. >>> Generally using jg/jl/jge/jle tends to be clearer. >> >> I use it to take advantage of the so called MacroFusion. >> It allows the merge of cmp and jxx, as long as the branch >> checks only one flag, so all signed branches are to be avoided >> (as stated by intel manual). >> The newer the cpu, the more opcodes (add/sub) >> could be merged and less limitations. > > Every µarch that can do macro-op fusion with add/sub can do so with > both signed and unsigned branch conditions. > There's actually only a single µarch that can fuse cmp with unsigned > but not signed branch conditions and that's more than a decade old > (Conroe), but if you want to check if (unsigned)a < (unsigned)b 'jb' > is preferred of 'jc' (both produce the exact same opcode). One is enough reason for me :} Since the same thing works just as well or better on newer CPU's. Also, using above/bellow on arithmetic operations is a lot more confusing than carry/borrow. You can see that I use "jc" and "jnc" after "sub". I'll keep it as it is, unless you have solid reason to change it. >>> Assembler-specific ifdeffery should be avoided in individual files. >>> Something equivalent already exists in x86inc actually but I don't >>> remember if it was merged to FFmpeg from upstream (x264) yet. >> >> Definitely not merged. >> >> I hear a lot about the improved x86inc, >> maybe it is time for you to merge it in FFmpeg? > > Now that I think about it I believe that part wasn't merged because > smartalign is broken on some older nasm versions (which is another > reason for avoiding things like this in individual files) and FFmpeg > doesn't enforce any specific version requirements. I guess it could be > worked around with some version checking ifdeffery if we know which > versions are affected. I don't see anything relevant in here: http://repo.or.cz/nasm.git/history/HEAD:/macros/smartalign.mac I also didn't notice anything similar in the changelog. I don't think this is relevant reason for not merging avx2 support. ;) >>> Isn't this just an overly complicated way of expressing "dd 0*4, 1*4, >>> 2*4, 3*4, 4*4, 5*4, 6*4, 7*4"? >> >> Yes. >> Do you know a way that works with "times 8"? >> I've done it this way to make it easy to change the size of the constant. >> >> Do you request I change it? > > I'd prefer just using that simple one-liner I posted. Replacing the > numbers if you want to change things later is trivial. It might be trivial, but it also makes it easier to misspell something. I'd prefer to keep it as it is. >>> Is reordering moves for the non-AVX path worth the additional >>> complexity? Microoptimizations that only affect legacy hardware are >>> IMO a bit questionable. >> >> Yes, I'm on "legacy" hardware and the improvement is reliably measurable. > > Ok, in that case you should also use shufps/addps instead of > vshufps/vaddps in the AVX branch for cosmetic reasons though (x86inc > performs automatic VEX-encoding). Only on instructions it knows... that brings us again to the AVX2. I had it mixed before, but I changed it to be consistent. Some new avx instruction are not overloaded or available without their "v-" prefix. e.g. x86util uses vpbroadcastw in SPLATW. It makes more sense to use "v" prefix for code that is guaranteed to be avx. >>> The AVX1 integer warning is
Re: [FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm
On Tue, Aug 1, 2017 at 11:46 PM, Ivan Kalvachevwrote: > On 7/31/17, Henrik Gramner wrote: >> Use rN instead of rNq for numbered registers (q suffix is used for >> named args only due to preprocessor limitations). > > Is this documented? Not sure, but there's probably a bunch small things like this that aren't really well documented. >> Use the same "standard" vertical alignment rules as most existing >> code, e.g. instructions indented by 4 spaces and operands aligned on >> the first comma. > > What do you mean operands aligned on the first comma? > The longest operand I had is "xmm0" , counting comma and space > I get 6 position alignment for operands. > Now, with "xm7" i can lower this to 5 position. Should I do that? > (I don't have "xm15" ). > 1234_1234_1234_123 VBROADCASTSS ym1, xm1 BLENDVPS m1, m2, m3 is the most commonly used alignment. >> Unless aligning every single loop entry helps a lot I'd avoid it since >> it does waste a bit of icache. > > I'll redo my benchmarks, but I have placed these aligns for a reason. > At some point removed debug instructions started making my code > slower. So I've placed align to avoid random slowdowns. Ok. >> Explicitly using the carry flag as a branch condition is a bit weird. >> Generally using jg/jl/jge/jle tends to be clearer. > > I use it to take advantage of the so called MacroFusion. > It allows the merge of cmp and jxx, as long as the branch > checks only one flag, so all signed branches are to be avoided > (as stated by intel manual). > The newer the cpu, the more opcodes (add/sub) > could be merged and less limitations. Every µarch that can do macro-op fusion with add/sub can do so with both signed and unsigned branch conditions. There's actually only a single µarch that can fuse cmp with unsigned but not signed branch conditions and that's more than a decade old (Conroe), but if you want to check if (unsigned)a < (unsigned)b 'jb' is preferred of 'jc' (both produce the exact same opcode). >> Assembler-specific ifdeffery should be avoided in individual files. >> Something equivalent already exists in x86inc actually but I don't >> remember if it was merged to FFmpeg from upstream (x264) yet. > > Definitely not merged. > > I hear a lot about the improved x86inc, > maybe it is time for you to merge it in FFmpeg? Now that I think about it I believe that part wasn't merged because smartalign is broken on some older nasm versions (which is another reason for avoiding things like this in individual files) and FFmpeg doesn't enforce any specific version requirements. I guess it could be worked around with some version checking ifdeffery if we know which versions are affected. >> Isn't this just an overly complicated way of expressing "dd 0*4, 1*4, >> 2*4, 3*4, 4*4, 5*4, 6*4, 7*4"? > > Yes. > Do you know a way that works with "times 8"? > I've done it this way to make it easy to change the size of the constant. > > Do you request I change it? I'd prefer just using that simple one-liner I posted. Replacing the numbers if you want to change things later is trivial. >> Is reordering moves for the non-AVX path worth the additional >> complexity? Microoptimizations that only affect legacy hardware are >> IMO a bit questionable. > > Yes, I'm on "legacy" hardware and the improvement is reliably measurable. Ok, in that case you should also use shufps/addps instead of vshufps/vaddps in the AVX branch for cosmetic reasons though (x86inc performs automatic VEX-encoding). >> The AVX1 integer warning is kind of silly and doesn't really serve any >> purpose. > > As I've said in this thread before, > If you use avx1 ymm with this macro > you won't get any warning that avx2 opcode has been generated > because x86inc does not overload avx2. Sure, but using any other AVX2 instruction in AVX functions wont result in any warnings either so it doesn't make much sense to do it in this specific case. >> Use the existing x86util VBROADCASTSS macro. Modify it if it's lacking >> something. > > The x86util VBROADCASTSS macro implements only > the avx1 variant that is memory to register. > My variant emulates the avx2 variant, > that also does reg to reg. > > My variant also avoids write dependency on "movss reg, reg". > ("movss reg, mem" clears the other elements, but > "movss reg, reg" preserves them. This creates dependency on the old > values of the register.) > > And yes, I did ask if I should separate > these macros in another file and > if I should try to merge them in x86util.asm first. > > What do you think on the matter? Add your improvements to the existing x86util macro (can be done in the same patch) and then use that. >> +%macro EMU_pbroadcastd 2 ; dst, src >> +%if cpuflag(avx2) >> + vpbroadcastd %1, %2 >> +%elif cpuflag(avx) && mmsize >= 32 >> +%error AVX1 does not support integer 256 bit ymm operations >> +%else >> + %ifnum sizeof%2 ; sse2 register >> +
Re: [FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm
On 7/31/17, Henrik Gramnerwrote: > On Wed, Jul 26, 2017 at 4:56 PM, Ivan Kalvachev > wrote: >> +++ b/libavcodec/x86/opus_pvq_search.asm > > Generic minor stuff: > > Use rN instead of rNq for numbered registers (q suffix is used for > named args only due to preprocessor limitations). Done. Is this documented? > Use the same "standard" vertical alignment rules as most existing > code, e.g. instructions indented by 4 spaces and operands aligned on > the first comma. What do you mean operands aligned on the first comma? The longest operand I had is "xmm0" , counting comma and space I get 6 position alignment for operands. Now, with "xm7" i can lower this to 5 position. Should I do that? (I don't have "xm15" ). Also, I've aligned the operand at 12 positions after their instruction start, because this is the size of the longest real instruction. As for why i have instruction start at 8'th position. It's because I've allocated the field at position 4-7 for instruction prefixes, and the "EMU_" is 4 positions. Now I have: 1234567812345_12345_12345_ EMU_broadcastss ym13, xm10 EMU_blendvpsxm1, xm2, m3 vblendvps blendvps I can ditch the prefix and shorten the operands. e.g. : 1234_1234_1234_1234_ VBROADCASTSS ym1, xm1 BLENDVPS m1, m2, m3 Is that acceptable? Or maybe you do mean: 1234_1234_1234_123 VBROADCASTSS ym1, xm1 BLENDVPS m1, m2, m3 However I would prefer to use 1234_1234_1234_123 VBROADCASTSS ym1, xm1 BLENDVPS m1, m2, m3 PBLENDVD xm1, xm2, xm3 (You do need fixed width font to see that correctly). I'll wait for reply before doing that. > Use xmN instead of xmmN (only really makes a difference when SWAP:s > are used, but might as well do it "correctly"). Done. > Use 32-bit operands, e.g. rNd, when 64-bit math isn't required. Done > Unless aligning every single loop entry helps a lot I'd avoid it since > it does waste a bit of icache. I'll redo my benchmarks, but I have placed these aligns for a reason. At some point removed debug instructions started making my code slower. So I've placed align to avoid random slowdowns. > Explicitly using the carry flag as a branch condition is a bit weird. > Generally using jg/jl/jge/jle tends to be clearer. I use it to take advantage of the so called MacroFusion. It allows the merge of cmp and jxx, as long as the branch checks only one flag, so all signed branches are to be avoided (as stated by intel manual). The newer the cpu, the more opcodes (add/sub) could be merged and less limitations. >> +%ifdef __NASM_VER__ >> +%use "smartalign" >> +ALIGNMODE p6 >> +%endif > > Assembler-specific ifdeffery should be avoided in individual files. > Something equivalent already exists in x86inc actually but I don't > remember if it was merged to FFmpeg from upstream (x264) yet. Definitely not merged. I hear a lot about the improved x86inc, maybe it is time for you to merge it in FFmpeg? >> +const_int32_offsets: >> +%rep 8 >> +dd $-const_int32_offsets >> +%endrep > > Isn't this just an overly complicated way of expressing "dd 0*4, 1*4, > 2*4, 3*4, 4*4, 5*4, 6*4, 7*4"? Yes. Do you know a way that works with "times 8"? I've done it this way to make it easy to change the size of the constant. Do you request I change it? >> +SECTION .text >> + >> + >> + >> + > > Reduce some of the excessive whitespace. Will do with the other cosmetics. >> +%macro HSUMPS 2 ; dst/src, tmp >> +%if cpuflag(avx) >> + %if sizeof%1>=32 ; avx >> + vperm2f128 %2, %1, %1, (0)*16+(1) ; >> %2=lo(%1)<<128+hi(%1) >> + vaddps %1, %2 >> + %endif >> + vshufps %2, %1, %1, q1032 >> + vaddps %1, %2 >> + vshufps %2, %1, %1, q0321 >> + vaddps %1, %2 >> + >> +%else ; this form is a bit faster than the short avx-like emulation. >> +movaps %2, %1;[d, c, b, >> a ] >> +shufps %1, %1, q1032 ; %2=[b, a, d, >> c ] >> +addps %1, %2; %1=[b+d, a+c, d+b, >> c+a ] >> +movaps %2, %1 >> +shufps %1, %1, q0321 ; %2=[c+a, b+d, a+c, >> d+b ] >> +addps %1, %2; %1=[c+a+b+d, b+d+a+c, a+c+d+b, >> d+b+c+a ] >> +; all %1 members should be equal for as long as float a+b==b+a >> +%endif >> +%endmacro > > Is reordering moves for the non-AVX path worth the additional > complexity? Microoptimizations that only affect legacy hardware are > IMO a bit questionable. Yes, I'm on "legacy" hardware and the improvement is reliably measurable. >> +%macro EMU_pblendvb 3 ; dst/src_a, src_b, mask >> +%if cpuflag(avx) >> + %if cpuflag(avx) && notcpuflag(avx2) && sizeof%1 >= 32 >> +%error
Re: [FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm
On Wed, Jul 26, 2017 at 4:56 PM, Ivan Kalvachevwrote: > +++ b/libavcodec/x86/opus_pvq_search.asm Generic minor stuff: Use rN instead of rNq for numbered registers (q suffix is used for named args only due to preprocessor limitations). Use the same "standard" vertical alignment rules as most existing code, e.g. instructions indented by 4 spaces and operands aligned on the first comma. Use xmN instead of xmmN (only really makes a difference when SWAP:s are used, but might as well do it "correctly"). Use 32-bit operands, e.g. rNd, when 64-bit math isn't required. Unless aligning every single loop entry helps a lot I'd avoid it since it does waste a bit of icache. Explicitly using the carry flag as a branch condition is a bit weird. Generally using jg/jl/jge/jle tends to be clearer. > +%ifdef __NASM_VER__ > +%use "smartalign" > +ALIGNMODE p6 > +%endif Assembler-specific ifdeffery should be avoided in individual files. Something equivalent already exists in x86inc actually but I don't remember if it was merged to FFmpeg from upstream (x264) yet. > +const_int32_offsets: > +%rep 8 > +dd $-const_int32_offsets > +%endrep Isn't this just an overly complicated way of expressing "dd 0*4, 1*4, 2*4, 3*4, 4*4, 5*4, 6*4, 7*4"? > +SECTION .text > + > + > + > + Reduce some of the excessive whitespace. > +%macro HSUMPS 2 ; dst/src, tmp > +%if cpuflag(avx) > + %if sizeof%1>=32 ; avx > + vperm2f128 %2, %1, %1, (0)*16+(1) ; %2=lo(%1)<<128+hi(%1) > + vaddps %1, %2 > + %endif > + vshufps %2, %1, %1, q1032 > + vaddps %1, %2 > + vshufps %2, %1, %1, q0321 > + vaddps %1, %2 > + > +%else ; this form is a bit faster than the short avx-like emulation. > +movaps %2, %1;[d, c, b, a > ] > +shufps %1, %1, q1032 ; %2=[b, a, d, c > ] > +addps %1, %2; %1=[b+d, a+c, d+b, c+a > ] > +movaps %2, %1 > +shufps %1, %1, q0321 ; %2=[c+a, b+d, a+c, d+b > ] > +addps %1, %2; %1=[c+a+b+d, b+d+a+c, a+c+d+b, > d+b+c+a ] > +; all %1 members should be equal for as long as float a+b==b+a > +%endif > +%endmacro Is reordering moves for the non-AVX path worth the additional complexity? Microoptimizations that only affect legacy hardware are IMO a bit questionable. > +%macro EMU_pblendvb 3 ; dst/src_a, src_b, mask > +%if cpuflag(avx) > + %if cpuflag(avx) && notcpuflag(avx2) && sizeof%1 >= 32 > +%error AVX1 does not support integer 256 bit ymm operations > + %endif > + > + vpblendvb%1, %1, %2, %3 > +;- > +%elif cpuflag(sse4) > +%ifnidn %3,xmm0 > + %error sse41 blendvps uses xmm0 as default 3 operand, you used %3 > +%endif > +pblendvb%1, %2, %3 > +;- > +%else > +pxor%2, %1 > +pand%2, %3 > +pxor%1, %2 > +%endif > +%endmacro The AVX1 integer warning is kind of silly and doesn't really serve any purpose. > +%macro EMU_broadcastss 2 ; dst, src > +%if cpuflag(avx2) > + vbroadcastss %1, %2; ymm, xmm > +;- > +%elif cpuflag(avx) > + %ifnum sizeof%2 ; avx1 register > + vpermilpsxmm%1, xmm%2, q ; xmm, xmm, imm || ymm, ymm, imm > +%if sizeof%1 >= 32 ; mmsize>=32 > + vinsertf128 %1, %1,xmm%1, 1 ; ymm, ymm, xmm, im > +%endif > + %else ; avx1 memory > + vbroadcastss %1, %2; ymm, mm32 || xmm, mm32 > + %endif > +;- > +%else > + %ifnum sizeof%2 ; sse register > +shufps %1, %2, %2, q > + %else ; sse memory > +movss %1, %2; this zeroes the other 3 elements > +shufps %1, %1, 0 > + %endif > +%endif > +%endmacro Use the existing x86util VBROADCASTSS macro. Modify it if it's lacking something. +%macro EMU_pbroadcastd 2 ; dst, src +%if cpuflag(avx2) + vpbroadcastd %1, %2 +%elif cpuflag(avx) && mmsize >= 32 +%error AVX1 does not support integer 256 bit ymm operations +%else + %ifnum sizeof%2 ; sse2 register +pshufd %1, %2, q + %else ; sse memory +movd%1, %2; movd zero extends +pshufd %1, %1, 0 + %endif +%endif +%endmacro Same as above, but using the SPLATD macro. > +; Merge parallel maximums final round (1 vs 1) > +movaps xmm0, xmm3 ; m0 = m3[0] = p[0] > +shufps xmm3, xmm3, q ; m3 = m3[1] = p[1] > +cmpless xmm0, xmm3 Use 3-arg instead of reg-reg movaps. Use the actual cmpss instruction instead
Re: [FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm
On 7/27/17, Rostislav Pehlivanovwrote: > On 27 July 2017 at 09:38, Ivan Kalvachev wrote: > >> On 7/27/17, Rostislav Pehlivanov wrote: >> > On 26 July 2017 at 15:56, Ivan Kalvachev wrote: >> > >> >> +if (ARCH_X86 && CONFIG_OPUS_ENCODER) >> >> +ff_opus_dsp_init_x86(s); >> >> >> > >> > Just change it to >> > +if (ARCH_X86) >> > >> > The init function is named opus_dsp, so it'll get used to other opus >> > things, not just the encoder. >> >> But at the moment it does not. >> I do prefer to leave that task for the one that >> adds opus decoder functions. >> >> Also this change alone would break compilation, since >> it also requires changing the libavcodec/x86/Makefile >> and adding the guard inside the opus_dsp_init.c >> >> Another option is to have "opus_enc_dsp_init.c" and call >> the function "ff_opus_enc_dsp_init_x86()". >> >> Do tell me which option do you prefer >> and do you insist on v7 just for that. >> >> > The assembly code looks fine to me, but other people will have to take a >> > look at it in case I'm missing something. > > The former, but that can be changed later after pushing Here is the patch. I'll merge it in v7, if there is one. Please note that makefile needs to use two separate config_opus_decoder/encoder, since there is no config_opus_codec . All other dsp seem to use separate files for encoder and decoder dsp. Best Regards. From 1a6ee9b2880c67db25737a6317f09cbbac441c83 Mon Sep 17 00:00:00 2001 From: Ivan Kalvachev Date: Thu, 27 Jul 2017 14:21:33 +0300 Subject: [PATCH 2/6] Build opus dsp for encoder and decoder. --- libavcodec/opus_pvq.c | 2 +- libavcodec/x86/Makefile| 1 + libavcodec/x86/opus_dsp_init.c | 2 ++ 3 files changed, 4 insertions(+), 1 deletion(-) diff --git a/libavcodec/opus_pvq.c b/libavcodec/opus_pvq.c index 3aa502929c..2fb276099b 100644 --- a/libavcodec/opus_pvq.c +++ b/libavcodec/opus_pvq.c @@ -947,7 +947,7 @@ int av_cold ff_celt_pvq_init(CeltPVQ **pvq) s->encode_band= pvq_encode_band; s->band_cost = pvq_band_cost; -if (ARCH_X86 && CONFIG_OPUS_ENCODER) +if (ARCH_X86) ff_opus_dsp_init_x86(s); *pvq = s; diff --git a/libavcodec/x86/Makefile b/libavcodec/x86/Makefile index 9875f48797..e36644c72a 100644 --- a/libavcodec/x86/Makefile +++ b/libavcodec/x86/Makefile @@ -52,6 +52,7 @@ OBJS-$(CONFIG_APNG_DECODER)+= x86/pngdsp_init.o OBJS-$(CONFIG_CAVS_DECODER)+= x86/cavsdsp.o OBJS-$(CONFIG_DCA_DECODER) += x86/dcadsp_init.o x86/synth_filter_init.o OBJS-$(CONFIG_DNXHD_ENCODER) += x86/dnxhdenc_init.o +OBJS-$(CONFIG_OPUS_DECODER)+= x86/opus_dsp_init.o OBJS-$(CONFIG_OPUS_ENCODER)+= x86/opus_dsp_init.o OBJS-$(CONFIG_HEVC_DECODER)+= x86/hevcdsp_init.o OBJS-$(CONFIG_JPEG2000_DECODER)+= x86/jpeg2000dsp_init.o diff --git a/libavcodec/x86/opus_dsp_init.c b/libavcodec/x86/opus_dsp_init.c index f4c25822db..c51f786ee8 100644 --- a/libavcodec/x86/opus_dsp_init.c +++ b/libavcodec/x86/opus_dsp_init.c @@ -32,6 +32,7 @@ av_cold void ff_opus_dsp_init_x86(CeltPVQ *s) { int cpu_flags = av_get_cpu_flags(); +#if CONFIG_OPUS_ENCODER if (EXTERNAL_SSE2(cpu_flags)) s->pvq_search = ff_pvq_search_sse2; @@ -40,4 +41,5 @@ av_cold void ff_opus_dsp_init_x86(CeltPVQ *s) if (EXTERNAL_AVX(cpu_flags)) s->pvq_search = ff_pvq_search_avx; +#endif } -- 2.13.2 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm
On 27 July 2017 at 09:38, Ivan Kalvachevwrote: > On 7/27/17, Rostislav Pehlivanov wrote: > > On 26 July 2017 at 15:56, Ivan Kalvachev wrote: > > > >> +if (ARCH_X86 && CONFIG_OPUS_ENCODER) > >> +ff_opus_dsp_init_x86(s); > >> > > > > Just change it to > > +if (ARCH_X86) > > > > The init function is named opus_dsp, so it'll get used to other opus > > things, not just the encoder. > > But at the moment it does not. > I do prefer to leave that task for the one that > adds opus decoder functions. > > Also this change alone would break compilation, since > it also requires changing the libavcodec/x86/Makefile > and adding the guard inside the opus_dsp_init.c > > Another option is to have "opus_enc_dsp_init.c" and call > the function "ff_opus_enc_dsp_init_x86()". > > Do tell me which option do you prefer > and do you insist on v7 just for that. > > > The assembly code looks fine to me, but other people will have to take a > > look at it in case I'm missing something. > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel > The former, but that can be changed later after pushing ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm
On 7/27/17, Rostislav Pehlivanovwrote: > On 26 July 2017 at 15:56, Ivan Kalvachev wrote: > >> +if (ARCH_X86 && CONFIG_OPUS_ENCODER) >> +ff_opus_dsp_init_x86(s); >> > > Just change it to > +if (ARCH_X86) > > The init function is named opus_dsp, so it'll get used to other opus > things, not just the encoder. But at the moment it does not. I do prefer to leave that task for the one that adds opus decoder functions. Also this change alone would break compilation, since it also requires changing the libavcodec/x86/Makefile and adding the guard inside the opus_dsp_init.c Another option is to have "opus_enc_dsp_init.c" and call the function "ff_opus_enc_dsp_init_x86()". Do tell me which option do you prefer and do you insist on v7 just for that. > The assembly code looks fine to me, but other people will have to take a > look at it in case I'm missing something. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm
On 26 July 2017 at 15:56, Ivan Kalvachevwrote: > +if (ARCH_X86 && CONFIG_OPUS_ENCODER) > +ff_opus_dsp_init_x86(s); > Just change it to +if (ARCH_X86) The init function is named opus_dsp, so it'll get used to other opus things, not just the encoder. The assembly code looks fine to me, but other people will have to take a look at it in case I'm missing something. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel