Re: [FFmpeg-devel] [PATCH 3/3] avcodec/scpr: optimize shift loop.

2017-09-09 Thread Michael Niedermayer
On Sat, Sep 09, 2017 at 03:11:00PM +0100, Derek Buitenhuis wrote:
> On 9/8/2017 11:15 PM, James Almer wrote:
> > It reads eight bytes at a time if the buffer is sufficiently aligned,
> > then finishes reading the remaining bytes one at a time.
> > If the buffer is unaligned, it reads everything one byte at a time like
> > it used to.
> > 
> > See ff_h2645_extract_rbsp() and add_bytes_c() for another example of
> > this optimization.
> 
> So put a comment, or at least put it in the commit message.
> 
> It isn't exactly straightforward; it's like reading Hex Rays output.

ill repost the patch with comments

thx

[...]
-- 
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

It is dangerous to be right in matters on which the established authorities
are wrong. -- Voltaire


signature.asc
Description: Digital signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH 3/3] avcodec/scpr: optimize shift loop.

2017-09-09 Thread Derek Buitenhuis
On 9/8/2017 11:15 PM, James Almer wrote:
> It reads eight bytes at a time if the buffer is sufficiently aligned,
> then finishes reading the remaining bytes one at a time.
> If the buffer is unaligned, it reads everything one byte at a time like
> it used to.
> 
> See ff_h2645_extract_rbsp() and add_bytes_c() for another example of
> this optimization.

So put a comment, or at least put it in the commit message.

It isn't exactly straightforward; it's like reading Hex Rays output.

- Derek
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH 3/3] avcodec/scpr: optimize shift loop.

2017-09-08 Thread James Almer
On 9/8/2017 8:49 PM, Michael Niedermayer wrote:
> On Fri, Sep 08, 2017 at 06:43:06PM -0300, James Almer wrote:
>> On 9/8/2017 6:29 PM, Michael Niedermayer wrote:
>>> Speeds code up from 50sec to 15sec
>>>
>>> Fixes Timeout
>>> Fixes: 3242/clusterfuzz-testcase-5811951672229888
>>>
>>> Found-by: continuous fuzzing process 
>>> https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
>>> Signed-off-by: Michael Niedermayer 
>>> ---
>>>  libavcodec/scpr.c | 11 ++-
>>>  1 file changed, 10 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/libavcodec/scpr.c b/libavcodec/scpr.c
>>> index 37fbe7a106..2ef63a7bf8 100644
>>> --- a/libavcodec/scpr.c
>>> +++ b/libavcodec/scpr.c
>>> @@ -827,7 +827,16 @@ static int decode_frame(AVCodecContext *avctx, void 
>>> *data, int *got_frame,
>>>  return ret;
>>>  
>>>  for (y = 0; y < avctx->height; y++) {
>>> -for (x = 0; x < avctx->width * 4; x++) {
>>> +if (!(((uintptr_t)dst) & 7)) {
>>> +uint64_t *dst64 = (uint64_t *)dst;
>>> +int w = avctx->width>>1;
>>> +for (x = 0; x < w; x++) {
>>> +dst64[x] = (dst64[x] << 3) & 0xFCFCFCFCFCFCFCFCULL;
>>
>> Shouldn't this be used only if HAVE_FAST_64BIT is true, and a version
>> shifting four bytes at a time used otherwise? That's how we do almost
>> everywhere else.
> 
> The compiler would break the 64bit into two 32bit automatically.
> I can write an explicit version if that is wanted ?
> it seemed overkill for this here though

I guess it's simple enough that the compiler can figure that out, yeah.
Should be ok then.

> 
> 
>>
>> The chances for anyone bothering writing simd for this decoder are
>> almost none, so adding C optimized loops is ok in this case.
>>
>>> +}
>>> +x *= 8;
>>> +} else
>>> +x = 0;
>>
>> How does this fix the timeout if the new code is only run if the pointer
>> is eight byte aligned? (or four once you add that).
> 
> The timeout is IIRC currently defined as 30sec or so on the platform
> the fuzzer runs on, data is aligned in that case.
> 
> You could imagine a platform where data is not aligned, yes
> on that patform the patch would not improve the time decoding takes.
> Simiarly you could imagine a CPU that supports only 8bit operations,
> or just a slower CPU.
> 
> This patch adds some optimizations that makes decoding faster which
> gets us over the threshold for this particular sample and a normal
> modern desktop platform.
> 
> Its quite possible another scpr file will still hit the threshold
> and certainly another threshold or other hw could trigger it
> still with this sample
> 
> I would very much prefer a more universal fix ...
> But i didnt see one and making that loop 3 times as fast is great on
> its own.

Ok. As i said nobody is going to write simd for this, so adding C
optimized code is fine.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH 3/3] avcodec/scpr: optimize shift loop.

2017-09-08 Thread Michael Niedermayer
On Fri, Sep 08, 2017 at 06:43:06PM -0300, James Almer wrote:
> On 9/8/2017 6:29 PM, Michael Niedermayer wrote:
> > Speeds code up from 50sec to 15sec
> > 
> > Fixes Timeout
> > Fixes: 3242/clusterfuzz-testcase-5811951672229888
> > 
> > Found-by: continuous fuzzing process 
> > https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
> > Signed-off-by: Michael Niedermayer 
> > ---
> >  libavcodec/scpr.c | 11 ++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> > 
> > diff --git a/libavcodec/scpr.c b/libavcodec/scpr.c
> > index 37fbe7a106..2ef63a7bf8 100644
> > --- a/libavcodec/scpr.c
> > +++ b/libavcodec/scpr.c
> > @@ -827,7 +827,16 @@ static int decode_frame(AVCodecContext *avctx, void 
> > *data, int *got_frame,
> >  return ret;
> >  
> >  for (y = 0; y < avctx->height; y++) {
> > -for (x = 0; x < avctx->width * 4; x++) {
> > +if (!(((uintptr_t)dst) & 7)) {
> > +uint64_t *dst64 = (uint64_t *)dst;
> > +int w = avctx->width>>1;
> > +for (x = 0; x < w; x++) {
> > +dst64[x] = (dst64[x] << 3) & 0xFCFCFCFCFCFCFCFCULL;
> 
> Shouldn't this be used only if HAVE_FAST_64BIT is true, and a version
> shifting four bytes at a time used otherwise? That's how we do almost
> everywhere else.

The compiler would break the 64bit into two 32bit automatically.
I can write an explicit version if that is wanted ?
it seemed overkill for this here though


> 
> The chances for anyone bothering writing simd for this decoder are
> almost none, so adding C optimized loops is ok in this case.
> 
> > +}
> > +x *= 8;
> > +} else
> > +x = 0;
> 
> How does this fix the timeout if the new code is only run if the pointer
> is eight byte aligned? (or four once you add that).

The timeout is IIRC currently defined as 30sec or so on the platform
the fuzzer runs on, data is aligned in that case.

You could imagine a platform where data is not aligned, yes
on that patform the patch would not improve the time decoding takes.
Simiarly you could imagine a CPU that supports only 8bit operations,
or just a slower CPU.

This patch adds some optimizations that makes decoding faster which
gets us over the threshold for this particular sample and a normal
modern desktop platform.

Its quite possible another scpr file will still hit the threshold
and certainly another threshold or other hw could trigger it
still with this sample

I would very much prefer a more universal fix ...
But i didnt see one and making that loop 3 times as fast is great on
its own.

[...]
-- 
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Asymptotically faster algorithms should always be preferred if you have
asymptotical amounts of data


signature.asc
Description: Digital signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH 3/3] avcodec/scpr: optimize shift loop.

2017-09-08 Thread James Almer
On 9/8/2017 6:47 PM, Kieran Kunhya wrote:
> On Fri, 8 Sep 2017 at 22:29 Michael Niedermayer 
> wrote:
> 
>> Speeds code up from 50sec to 15sec
>>
>> Fixes Timeout
>> Fixes: 3242/clusterfuzz-testcase-5811951672229888
>>
>> Found-by: continuous fuzzing process
>> https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
>> Signed-off-by
>> :
>> Michael Niedermayer 
>> ---
>>  libavcodec/scpr.c | 11 ++-
>>  1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/libavcodec/scpr.c b/libavcodec/scpr.c
>> index 37fbe7a106..2ef63a7bf8 100644
>> --- a/libavcodec/scpr.c
>> +++ b/libavcodec/scpr.c
>> @@ -827,7 +827,16 @@ static int decode_frame(AVCodecContext *avctx, void
>> *data, int *got_frame,
>>  return ret;
>>
>>  for (y = 0; y < avctx->height; y++) {
>> -for (x = 0; x < avctx->width * 4; x++) {
>> +if (!(((uintptr_t)dst) & 7)) {
>> +uint64_t *dst64 = (uint64_t *)dst;
>> +int w = avctx->width>>1;
>> +for (x = 0; x < w; x++) {
>> +dst64[x] = (dst64[x] << 3) & 0xFCFCFCFCFCFCFCFCULL;
>> +}
>> +x *= 8;
>> +} else
>> +x = 0;
>> +for (; x < avctx->width * 4; x++) {
>>  dst[x] = dst[x] << 3;
>>  }
>>  dst += frame->linesize[0];
>> --
>> 2.14.1
>>
> 
> This is as clear as mud.

It reads eight bytes at a time if the buffer is sufficiently aligned,
then finishes reading the remaining bytes one at a time.
If the buffer is unaligned, it reads everything one byte at a time like
it used to.

See ff_h2645_extract_rbsp() and add_bytes_c() for another example of
this optimization.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH 3/3] avcodec/scpr: optimize shift loop.

2017-09-08 Thread Kieran Kunhya
On Fri, 8 Sep 2017 at 22:29 Michael Niedermayer 
wrote:

> Speeds code up from 50sec to 15sec
>
> Fixes Timeout
> Fixes: 3242/clusterfuzz-testcase-5811951672229888
>
> Found-by: continuous fuzzing process
> https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
> Signed-off-by
> :
> Michael Niedermayer 
> ---
>  libavcodec/scpr.c | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/libavcodec/scpr.c b/libavcodec/scpr.c
> index 37fbe7a106..2ef63a7bf8 100644
> --- a/libavcodec/scpr.c
> +++ b/libavcodec/scpr.c
> @@ -827,7 +827,16 @@ static int decode_frame(AVCodecContext *avctx, void
> *data, int *got_frame,
>  return ret;
>
>  for (y = 0; y < avctx->height; y++) {
> -for (x = 0; x < avctx->width * 4; x++) {
> +if (!(((uintptr_t)dst) & 7)) {
> +uint64_t *dst64 = (uint64_t *)dst;
> +int w = avctx->width>>1;
> +for (x = 0; x < w; x++) {
> +dst64[x] = (dst64[x] << 3) & 0xFCFCFCFCFCFCFCFCULL;
> +}
> +x *= 8;
> +} else
> +x = 0;
> +for (; x < avctx->width * 4; x++) {
>  dst[x] = dst[x] << 3;
>  }
>  dst += frame->linesize[0];
> --
> 2.14.1
>

This is as clear as mud.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH 3/3] avcodec/scpr: optimize shift loop.

2017-09-08 Thread James Almer
On 9/8/2017 6:29 PM, Michael Niedermayer wrote:
> Speeds code up from 50sec to 15sec
> 
> Fixes Timeout
> Fixes: 3242/clusterfuzz-testcase-5811951672229888
> 
> Found-by: continuous fuzzing process 
> https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
> Signed-off-by: Michael Niedermayer 
> ---
>  libavcodec/scpr.c | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/libavcodec/scpr.c b/libavcodec/scpr.c
> index 37fbe7a106..2ef63a7bf8 100644
> --- a/libavcodec/scpr.c
> +++ b/libavcodec/scpr.c
> @@ -827,7 +827,16 @@ static int decode_frame(AVCodecContext *avctx, void 
> *data, int *got_frame,
>  return ret;
>  
>  for (y = 0; y < avctx->height; y++) {
> -for (x = 0; x < avctx->width * 4; x++) {
> +if (!(((uintptr_t)dst) & 7)) {
> +uint64_t *dst64 = (uint64_t *)dst;
> +int w = avctx->width>>1;
> +for (x = 0; x < w; x++) {
> +dst64[x] = (dst64[x] << 3) & 0xFCFCFCFCFCFCFCFCULL;

Shouldn't this be used only if HAVE_FAST_64BIT is true, and a version
shifting four bytes at a time used otherwise? That's how we do almost
everywhere else.

The chances for anyone bothering writing simd for this decoder are
almost none, so adding C optimized loops is ok in this case.

> +}
> +x *= 8;
> +} else
> +x = 0;

How does this fix the timeout if the new code is only run if the pointer
is eight byte aligned? (or four once you add that).

> +for (; x < avctx->width * 4; x++) {
>  dst[x] = dst[x] << 3;
>  }
>  dst += frame->linesize[0];
> 

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 3/3] avcodec/scpr: optimize shift loop.

2017-09-08 Thread Michael Niedermayer
Speeds code up from 50sec to 15sec

Fixes Timeout
Fixes: 3242/clusterfuzz-testcase-5811951672229888

Found-by: continuous fuzzing process 
https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
Signed-off-by: Michael Niedermayer 
---
 libavcodec/scpr.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/libavcodec/scpr.c b/libavcodec/scpr.c
index 37fbe7a106..2ef63a7bf8 100644
--- a/libavcodec/scpr.c
+++ b/libavcodec/scpr.c
@@ -827,7 +827,16 @@ static int decode_frame(AVCodecContext *avctx, void *data, 
int *got_frame,
 return ret;
 
 for (y = 0; y < avctx->height; y++) {
-for (x = 0; x < avctx->width * 4; x++) {
+if (!(((uintptr_t)dst) & 7)) {
+uint64_t *dst64 = (uint64_t *)dst;
+int w = avctx->width>>1;
+for (x = 0; x < w; x++) {
+dst64[x] = (dst64[x] << 3) & 0xFCFCFCFCFCFCFCFCULL;
+}
+x *= 8;
+} else
+x = 0;
+for (; x < avctx->width * 4; x++) {
 dst[x] = dst[x] << 3;
 }
 dst += frame->linesize[0];
-- 
2.14.1

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel