On 2017-11-27 17:50, Henrik Gramner wrote:
> On Sun, Nov 26, 2017 at 11:51 PM, James Darnley
> wrote:
>> -pd_0_int_min: times 2 dd 0, -2147483648
>> -pq_int_min: times 2 dq -2147483648
>> -pq_int_max: times 2 dq 2147483647
>> +pd_0_int_min: times 4 dd 0, -2147483648
>> +pq_int_min: tim
>> Using 128-bit broadcasts is preferable over duplicating the constants
>> to 256-bit unless there's a good reason for doing so since it wastes
>> less cache and is faster on AMD CPU:s.
>
> What would that reason be? Afaik broadcasts are expensive, since they
> both load from memory then splat dat
On 11/27/2017 1:50 PM, Henrik Gramner wrote:
> On Sun, Nov 26, 2017 at 11:51 PM, James Darnley
> wrote:
>> -pd_0_int_min: times 2 dd 0, -2147483648
>> -pq_int_min: times 2 dq -2147483648
>> -pq_int_max: times 2 dq 2147483647
>> +pd_0_int_min: times 4 dd 0, -2147483648
>> +pq_int_min: t
On Sun, Nov 26, 2017 at 11:51 PM, James Darnley wrote:
> -pd_0_int_min: times 2 dd 0, -2147483648
> -pq_int_min: times 2 dq -2147483648
> -pq_int_max: times 2 dq 2147483647
> +pd_0_int_min: times 4 dd 0, -2147483648
> +pq_int_min: times 4 dq -2147483648
> +pq_int_max: times 4 dq 21
On 11/26/2017 8:13 PM, Rostislav Pehlivanov wrote:
> On 26 November 2017 at 22:51, James Darnley wrote:
>
>> When compared to the SSE4.2 version runtime, is reduced by 1 to 26%. The
>> function itself is around 2 times faster.
>> ---
>> libavcodec/x86/flac_dsp_gpl.asm | 56 +
On 11/26/2017 7:51 PM, James Darnley wrote:
> When compared to the SSE4.2 version runtime, is reduced by 1 to 26%. The
> function itself is around 2 times faster.
> ---
> libavcodec/x86/flac_dsp_gpl.asm | 56
> +++--
> libavcodec/x86/flacdsp_init.c | 5 +++-
On 2017-11-27 00:13, Rostislav Pehlivanov wrote:
> On 26 November 2017 at 22:51, James Darnley wrote:
>> @@ -123,7 +123,10 @@ RET
>> %endmacro
>>
>> %macro PMINSQ 3
>> -pcmpgtq %3, %2, %1
>> +mova%3, %2
>> +; We cannot use the 3-operand format because the memory location
>> canno
On 26 November 2017 at 22:51, James Darnley wrote:
> When compared to the SSE4.2 version runtime, is reduced by 1 to 26%. The
> function itself is around 2 times faster.
> ---
> libavcodec/x86/flac_dsp_gpl.asm | 56 ++
> +--
> libavcodec/x86/flacdsp_init.c
When compared to the SSE4.2 version runtime, is reduced by 1 to 26%. The
function itself is around 2 times faster.
---
libavcodec/x86/flac_dsp_gpl.asm | 56 +++--
libavcodec/x86/flacdsp_init.c | 5 +++-
2 files changed, 47 insertions(+), 14 deletions(-)
dif