[FFmpeg-devel] [PATCH v3] libavcodec/vp8dec: fix the multi-thread HWAccel decode error

2019-06-11 Thread Shaofei Wang
Fix the issue: https://github.com/intel/media-driver/issues/317

the root cause is update_dimensions will be called multple times
when decoder thread number is not only 1, but update_dimensions
call get_pixel_format in each decode thread will trigger the
hwaccel_uninit/hwaccel_init more than once. But only one hwaccel
should be shared with all decode threads.
in current context,
there are 3 situations in the update_dimensions():
1. First time calling. No matter single thread or multithread,
   get_pixel_format() should be called after dimensions were
   set;
2. Dimention changed at the runtime. Dimention need to be
   updated when macroblocks_base is already allocated,
   get_pixel_format() should be called to recreate new frames
   according to updated dimension;
3. Multithread first time calling. After decoder init, the
   other threads will call update_dimensions() at first time
   to allocate macroblocks_base and set dimensions.
   But get_pixel_format() is shouldn't be called due to low
   level frames and context are already created.

In this fix, we only call update_dimensions as need.

Signed-off-by: Wang, Shaofei 
Reviewed-by: Jun, Zhao 
Reviewed-by: Haihao Xiang 
---
Updated typo in the commit message

 libavcodec/vp8.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/libavcodec/vp8.c b/libavcodec/vp8.c
index ba79e5f..0a7f38b 100644
--- a/libavcodec/vp8.c
+++ b/libavcodec/vp8.c
@@ -187,7 +187,7 @@ static av_always_inline
 int update_dimensions(VP8Context *s, int width, int height, int is_vp7)
 {
 AVCodecContext *avctx = s->avctx;
-int i, ret;
+int i, ret, dim_reset = 0;
 
 if (width  != s->avctx->width || ((width+15)/16 != s->mb_width || 
(height+15)/16 != s->mb_height) && s->macroblocks_base ||
 height != s->avctx->height) {
@@ -196,9 +196,12 @@ int update_dimensions(VP8Context *s, int width, int 
height, int is_vp7)
 ret = ff_set_dimensions(s->avctx, width, height);
 if (ret < 0)
 return ret;
+
+dim_reset = (s->macroblocks_base != NULL);
 }
 
-if (!s->actually_webp && !is_vp7) {
+if ((s->pix_fmt == AV_PIX_FMT_NONE || dim_reset) &&
+ !s->actually_webp && !is_vp7) {
 s->pix_fmt = get_pixel_format(s);
 if (s->pix_fmt < 0)
 return AVERROR(EINVAL);
-- 
1.8.3.1

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

[FFmpeg-devel] [PATCH v2] libavcodec/vp8dec: fix the multi-thread HWAccel decode error

2019-03-27 Thread Shaofei Wang
Fix the issue: https://github.com/intel/media-driver/issues/317

the root cause is update_dimensions will be called multple times
when decoder thread number is not only 1, but update_dimensions
call get_pixel_format in each decode thread will trigger the
hwaccel_uninit/hwaccel_init more than once. But only one hwaccel
should be shared with all decode threads.
in current context,
there are 3 situations in the update_dimensions():
1. First time calling. No matter single thread or multithread,
   get_pixel_format() should be called after dimensions were
   set;
2. Dimention changed at the runtime. Dimention need to be
   updated when macroblocks_base is already allocated,
   get_pixel_format() should be called to recreate new frames
   according to updated dimention;
3. Multithread first time calling. After decoder init, the
   other threads will call update_dimensions() at first time
   to allocate macroblocks_base and set dimensions.
   But get_pixel_format() is shouldn't be called due to low
   level frames and context are already created.

In this fix, we only call update_dimensions as need.

Signed-off-by: Wang, Shaofei 
Reviewed-by: Jun, Zhao 
Reviewed-by: Haihao Xiang 
---
Previous code reviews:
2019-03-06 9:25 GMT+01:00, Wang, Shaofei :
>> -Original Message-
>> From: ffmpeg-devel [mailto:ffmpeg-devel-boun...@ffmpeg.org] On Behalf 
>> Of Carl Eugen Hoyos
>> Sent: Wednesday, March 6, 2019 3:49 PM
>> To: FFmpeg development discussions and patches 
>> 
>> Subject: Re: [FFmpeg-devel] [PATCH] libavcodec/vp8dec: fix the 
>> multi-thread HWAccel decode error
>>
>> 2018-08-09 9:09 GMT+02:00, Jun Zhao :
>> > the root cause is update_dimentions call get_pixel_format will 
>> > trigger the hwaccel_uninit/hwaccel_init , in current context, there 
>> > are 3 situations in the update_dimentions():
>> > 1. First time calling. No matter single thread or multithread,
>> >get_pixel_format() should be called after dimentions were
>> >set;
>> > 2. Dimention changed at the runtime. Dimention need to be
>> >updated when macroblocks_base is already allocated,
>> >get_pixel_format() should be called to recreate new frames
>> >according to updated dimention;
>> > 3. Multithread first time calling. After decoder init, the
>> >other threads will call update_dimentions() at first time
>> >to allocate macroblocks_base and set dimentions.
>> >But get_pixel_format() is shouldn't be called due to low
>> >level frames and context are already created.
>> > In this fix, we only call update_dimentions as need.
>> >
>> > Signed-off-by: Wang, Shaofei 
>> > Reviewed-by: Jun, Zhao 
>> > ---
>> >  libavcodec/vp8.c |7 +--
>> >  1 files changed, 5 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/libavcodec/vp8.c b/libavcodec/vp8.c index 
>> > 3adfeac..18d1ada 100644
>> > --- a/libavcodec/vp8.c
>> > +++ b/libavcodec/vp8.c
>> > @@ -187,7 +187,7 @@ static av_always_inline  int 
>> > update_dimensions(VP8Context *s, int width, int height, int is_vp7)  {
>> >  AVCodecContext *avctx = s->avctx;
>> > -int i, ret;
>> > +int i, ret, dim_reset = 0;
>> >
>> >  if (width  != s->avctx->width || ((width+15)/16 != s->mb_width 
>> > ||
>> > (height+15)/16 != s->mb_height) && s->macroblocks_base ||
>> >  height != s->avctx->height) { @@ -196,9 +196,12 @@ int 
>> > update_dimensions(VP8Context *s, int width, int height, int is_vp7)
>> >  ret = ff_set_dimensions(s->avctx, width, height);
>> >  if (ret < 0)
>> >  return ret;
>> > +
>> > +dim_reset = (s->macroblocks_base != NULL);
>> >  }
>> >
>> > -if (!s->actually_webp && !is_vp7) {
>> > +if ((s->pix_fmt == AV_PIX_FMT_NONE || dim_reset) &&
>> > + !s->actually_webp && !is_vp7) {
>>
>> Why is the new variable dim_reset needed?
>> Wouldn't the patch be simpler if you used s->macroblocks_base here?
> Since dim_reset was set in the "if" segment, it equal to (width != 
> s->avctx->width || ((width+15)/16 != s->mb_width ||
> (height+15)/16 != s->mb_height) || height != s->avctx->height) &&
> s->macroblocks_base

Thank you!

Carl Eugen


 libavcodec/vp8.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/libavcodec/vp8.c b/libavcodec/vp8.c
index ba79e5f..0a7f38b 100644
--- a/libavcodec/vp8.c
+++ b/libavcodec/vp8.c
@@ -187,7 +187,7 @@ static av_always_inline
 int update_dimensions(VP8Context *s, int width, int height, int is_vp7)
 {
 AVCodecContext *avctx = s->avctx;
-int i, ret;
+int i, ret, dim_reset = 0;
 
 if (width  != s->avctx->width || ((width+15)/16 != s->mb_width || 
(height+15)/16 != s->mb_height) && s->macroblocks_base ||
 height != s->avctx->height) {
@@ -196,9 +196,12 @@ int update_dimensions(VP8Context *s, int width, int 
height, int is_vp7)
 ret = ff_set_dimensions(s->avctx, width, height);
 if (ret < 0)
 return ret;
+
+dim_reset = (s->macroblocks_base != NULL);
 }
 
-if 

[FFmpeg-devel] [PATCH] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.

2019-03-26 Thread Shaofei Wang
It enabled MULTIPLE SIMPLE filter graph concurrency, which bring above about
4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration

Below are some test cases and comparison as reference.
(Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz)
(Software: Intel iHD driver - 16.9.00100, CentOS 7)

For 1:N transcode by GPU acceleration with vaapi:
./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \
-hwaccel_output_format vaapi \
-i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale_vaapi=1280:720" -c:v h264_vaapi -f null /dev/null \
-vf "scale_vaapi=720:480" -c:v h264_vaapi -f null /dev/null

test results:
2 encoders 5 encoders 10 encoders
Improved   6.1%6.9%   5.5%

For 1:N transcode by GPU acceleration with QSV:
./ffmpeg -hwaccel qsv -c:v h264_qsv \
-i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale_qsv=1280:720:format=nv12" -c:v h264_qsv -f null /dev/null \
-vf "scale_qsv=720:480:format=nv12" -c:v h264_qsv -f null /dev/null

test results:
2 encoders  5 encoders 10 encoders
Improved   6%   4% 15%

For Intel GPU acceleration case, 1 decode to N scaling, by QSV:
./ffmpeg -hwaccel qsv -c:v h264_qsv \
-i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale_qsv=1280:720:format=nv12,hwdownload" -pix_fmt nv12 -f null 
/dev/null \
-vf "scale_qsv=720:480:format=nv12,hwdownload" -pix_fmt nv12 -f null 
/dev/null

test results:
2 scale  5 scale   10 scale
Improved   12% 21%21%

For CPU only 1 decode to N scaling:
./ffmpeg -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale=1280:720" -pix_fmt nv12 -f null /dev/null \
-vf "scale=720:480" -pix_fmt nv12 -f null /dev/null

test results:
2 scale  5 scale   10 scale
Improved   25%107%   148%

Signed-off-by: Wang, Shaofei 
---
The patch will only effect on multiple SIMPLE filter graphs pipeline,
Passed fate and refine the possible data race,
AFL tested, without introducing extra crashs/hangs

 fftools/ffmpeg.c | 172 +--
 fftools/ffmpeg.h |  13 +
 2 files changed, 169 insertions(+), 16 deletions(-)

diff --git a/fftools/ffmpeg.c b/fftools/ffmpeg.c
index 544f1a1..5f6e712 100644
--- a/fftools/ffmpeg.c
+++ b/fftools/ffmpeg.c
@@ -164,7 +164,13 @@ static struct termios oldtty;
 static int restore_tty;
 #endif
 
+/* enable abr threads when there were multiple simple filter graphs*/
+static int abr_threads_enabled = 0;
+
 #if HAVE_THREADS
+pthread_mutex_t fg_config_mutex;
+pthread_mutex_t ost_init_mutex;
+
 static void free_input_threads(void);
 #endif
 
@@ -509,6 +515,17 @@ static void ffmpeg_cleanup(int ret)
 }
 av_fifo_freep(>inputs[j]->ist->sub2video.sub_queue);
 }
+#if HAVE_THREADS
+if (abr_threads_enabled) {
+av_frame_free(>inputs[j]->input_frm);
+pthread_mutex_lock(>inputs[j]->process_mutex);
+fg->inputs[j]->waited_frm = NULL;
+fg->inputs[j]->t_end = 1;
+pthread_cond_signal(>inputs[j]->process_cond);
+pthread_mutex_unlock(>inputs[j]->process_mutex);
+pthread_join(fg->inputs[j]->abr_thread, NULL);
+}
+#endif
 av_buffer_unref(>inputs[j]->hw_frames_ctx);
 av_freep(>inputs[j]->name);
 av_freep(>inputs[j]);
@@ -1419,12 +1436,13 @@ static void finish_output_stream(OutputStream *ost)
  *
  * @return  0 for success, <0 for severe errors
  */
-static int reap_filters(int flush)
+static int reap_filters(int flush, InputFilter * ifilter)
 {
 AVFrame *filtered_frame = NULL;
 int i;
 
-/* Reap all buffers present in the buffer sinks */
+/* Reap all buffers present in the buffer sinks or just reap specified
+ * buffer which related with the filter graph who got ifilter as input*/
 for (i = 0; i < nb_output_streams; i++) {
 OutputStream *ost = output_streams[i];
 OutputFile*of = output_files[ost->file_index];
@@ -1432,13 +1450,25 @@ static int reap_filters(int flush)
 AVCodecContext *enc = ost->enc_ctx;
 int ret = 0;
 
+if (ifilter && abr_threads_enabled)
+if (ost != ifilter->graph->outputs[0]->ost)
+continue;
+
 if (!ost->filter || !ost->filter->graph->graph)
 continue;
 filter = ost->filter->filter;
 
 if (!ost->initialized) {
 char error[1024] = "";
+#if HAVE_THREADS
+if (abr_threads_enabled)
+pthread_mutex_lock(_init_mutex);
+#endif
 ret = init_output_stream(ost, error, sizeof(error));
+#if HAVE_THREADS
+if (abr_threads_enabled)
+pthread_mutex_unlock(_init_mutex);
+#endif
 if (ret < 0) {
 av_log(NULL, AV_LOG_ERROR, "Error initializing output stream 

[FFmpeg-devel] [PATCH v7] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.

2019-03-20 Thread Shaofei Wang
It enabled MULTIPLE SIMPLE filter graph concurrency, which bring above about
4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration

Below are some test cases and comparison as reference.
(Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz)
(Software: Intel iHD driver - 16.9.00100, CentOS 7)

For 1:N transcode by GPU acceleration with vaapi:
./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \
-hwaccel_output_format vaapi \
-i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale_vaapi=1280:720" -c:v h264_vaapi -f null /dev/null \
-vf "scale_vaapi=720:480" -c:v h264_vaapi -f null /dev/null

test results:
2 encoders 5 encoders 10 encoders
Improved   6.1%6.9%   5.5%

For 1:N transcode by GPU acceleration with QSV:
./ffmpeg -hwaccel qsv -c:v h264_qsv \
-i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale_qsv=1280:720:format=nv12" -c:v h264_qsv -f null /dev/null \
-vf "scale_qsv=720:480:format=nv12" -c:v h264_qsv -f null /dev/null

test results:
2 encoders  5 encoders 10 encoders
Improved   6%   4% 15%

For Intel GPU acceleration case, 1 decode to N scaling, by QSV:
./ffmpeg -hwaccel qsv -c:v h264_qsv \
-i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale_qsv=1280:720:format=nv12,hwdownload" -pix_fmt nv12 -f null 
/dev/null \
-vf "scale_qsv=720:480:format=nv12,hwdownload" -pix_fmt nv12 -f null 
/dev/null

test results:
2 scale  5 scale   10 scale
Improved   12% 21%21%

For CPU only 1 decode to N scaling:
./ffmpeg -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale=1280:720" -pix_fmt nv12 -f null /dev/null \
-vf "scale=720:480" -pix_fmt nv12 -f null /dev/null

test results:
2 scale  5 scale   10 scale
Improved   25%107%   148%

Signed-off-by: Wang, Shaofei 
Reviewed-by: Michael Niedermayer 
Reviewed-by: Mark Thompson 
---
The patch will only effect on multiple SIMPLE filter graphs pipeline,
Passed fate and refine the possible data race,
AFL tested, without introducing extra crashs/hangs:

  american fuzzy lop 2.52b (ffmpeg_g)

┌─ process timing ─┬─ overall results 
─┐
│run time : 0 days, 9 hrs, 48 min, 48 sec  │  cycles done : 0   
   │
│   last new path : 0 days, 0 hrs, 0 min, 0 sec│  total paths : 
1866   │
│ last uniq crash : none seen yet  │ uniq crashes : 0   
   │
│  last uniq hang : 0 days, 9 hrs, 19 min, 23 sec  │   uniq hangs : 35  
   │
├─ cycle progress ┬─ map coverage 
─┴───┤
│  now processing : 0 (0.00%) │map density : 24.91% / 36.60%
   │
│ paths timed out : 0 (0.00%) │ count coverage : 2.40 bits/tuple
   │
├─ stage progress ┼─ findings in depth 
┤
│  now trying : calibration   │ favored paths : 1 (0.05%)   
   │
│ stage execs : 0/8 (0.00%)   │  new edges on : 1100 (58.95%)   
   │
│ total execs : 123k  │ total crashes : 0 (0 unique)
   │
│  exec speed : 3.50/sec (...)│  total tmouts : 52 (47 unique)  
   │
├─ fuzzing strategy yields ───┴───┬─ path geometry 
┤
│   bit flips : 0/0, 0/0, 0/0 │levels : 2   
   │
│  byte flips : 0/0, 0/0, 0/0 │   pending : 1866
   │
│ arithmetics : 0/0, 0/0, 0/0 │  pend fav : 1   
   │
│  known ints : 0/0, 0/0, 0/0 │ own finds : 1862
   │
│  dictionary : 0/0, 0/0, 0/0 │  imported : n/a 
   │
│   havoc : 0/0, 0/0  │ stability : 76.69%  
   │
│trim : 0.00%/1828, n/a   
├┘
└─┘  [cpu000: 
59%]

 fftools/ffmpeg.c | 172 +--
 fftools/ffmpeg.h |  13 +
 2 files changed, 169 insertions(+), 16 deletions(-)

diff --git a/fftools/ffmpeg.c b/fftools/ffmpeg.c
index 544f1a1..59a953a 100644
--- a/fftools/ffmpeg.c
+++ b/fftools/ffmpeg.c
@@ -164,7 +164,13 @@ static struct termios oldtty;
 static int restore_tty;
 #endif
 
+/* enable abr threads when there were multiple simple filter graphs*/
+static int abr_threads_enabled = 0;
+
 #if HAVE_THREADS
+pthread_mutex_t fg_config_mutex;
+pthread_mutex_t ost_init_mutex;
+
 static void free_input_threads(void);
 #endif
 
@@ -509,6 +515,17 @@ static void ffmpeg_cleanup(int ret)
 }
 av_fifo_freep(>inputs[j]->ist->sub2video.sub_queue);
 }
+#if HAVE_THREADS
+if (abr_threads_enabled) {
+

[FFmpeg-devel] [PATCH v6] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.

2019-03-13 Thread Shaofei Wang
It enabled multiple simple filter graph concurrency, which bring above about
4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration

Below are some test cases and comparison as reference.
(Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz)
(Software: Intel iHD driver - 16.9.00100, CentOS 7)

For 1:N transcode by GPU acceleration with vaapi:
./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \
-hwaccel_output_format vaapi \
-i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale_vaapi=1280:720" -c:v h264_vaapi -f null /dev/null \
-vf "scale_vaapi=720:480" -c:v h264_vaapi -f null /dev/null

test results:
2 encoders 5 encoders 10 encoders
Improved   6.1%6.9%   5.5%

For 1:N transcode by GPU acceleration with QSV:
./ffmpeg -hwaccel qsv -c:v h264_qsv \
-i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale_qsv=1280:720:format=nv12" -c:v h264_qsv -f null /dev/null \
-vf "scale_qsv=720:480:format=nv12" -c:v h264_qsv -f null /dev/null

test results:
2 encoders  5 encoders 10 encoders
Improved   6%   4% 15%

For Intel GPU acceleration case, 1 decode to N scaling, by QSV:
./ffmpeg -hwaccel qsv -c:v h264_qsv \
-i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale_qsv=1280:720:format=nv12,hwdownload" -pix_fmt nv12 -f null 
/dev/null \
-vf "scale_qsv=720:480:format=nv12,hwdownload" -pix_fmt nv12 -f null 
/dev/null

test results:
2 scale  5 scale   10 scale
Improved   12% 21%21%

For CPU only 1 decode to N scaling:
./ffmpeg -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale=1280:720" -pix_fmt nv12 -f null /dev/null \
-vf "scale=720:480" -pix_fmt nv12 -f null /dev/null

test results:
2 scale  5 scale   10 scale
Improved   25%107%   148%

Signed-off-by: Wang, Shaofei 
---
Passed fate and refine the possible data race.
The patch will only effect on multiple SIMPLE filter graphs pipeline

 fftools/ffmpeg.c | 172 +--
 fftools/ffmpeg.h |  13 +
 2 files changed, 169 insertions(+), 16 deletions(-)

diff --git a/fftools/ffmpeg.c b/fftools/ffmpeg.c
index 544f1a1..c0c9ca8 100644
--- a/fftools/ffmpeg.c
+++ b/fftools/ffmpeg.c
@@ -164,7 +164,13 @@ static struct termios oldtty;
 static int restore_tty;
 #endif
 
+/* enable abr threads when there were multiple simple filter graphs*/
+static int abr_threads_enabled = 0;
+
 #if HAVE_THREADS
+pthread_mutex_t fg_config_mutex;
+pthread_mutex_t ost_init_mutex;
+
 static void free_input_threads(void);
 #endif
 
@@ -509,6 +515,17 @@ static void ffmpeg_cleanup(int ret)
 }
 av_fifo_freep(>inputs[j]->ist->sub2video.sub_queue);
 }
+#if HAVE_THREADS
+if (abr_threads_enabled) {
+av_frame_free(>inputs[j]->input_frm);
+pthread_mutex_lock(>inputs[j]->process_mutex);
+fg->inputs[j]->waited_frm = NULL;
+fg->inputs[j]->t_end = 1;
+pthread_cond_signal(>inputs[j]->process_cond);
+pthread_mutex_unlock(>inputs[j]->process_mutex);
+pthread_join(fg->inputs[j]->abr_thread, NULL);
+}
+#endif
 av_buffer_unref(>inputs[j]->hw_frames_ctx);
 av_freep(>inputs[j]->name);
 av_freep(>inputs[j]);
@@ -1419,12 +1436,13 @@ static void finish_output_stream(OutputStream *ost)
  *
  * @return  0 for success, <0 for severe errors
  */
-static int reap_filters(int flush)
+static int reap_filters(int flush, InputFilter * ifilter)
 {
 AVFrame *filtered_frame = NULL;
 int i;
 
-/* Reap all buffers present in the buffer sinks */
+/* Reap all buffers present in the buffer sinks or just reap specified
+ * buffer which related with the filter graph who got ifilter as input*/
 for (i = 0; i < nb_output_streams; i++) {
 OutputStream *ost = output_streams[i];
 OutputFile*of = output_files[ost->file_index];
@@ -1432,13 +1450,25 @@ static int reap_filters(int flush)
 AVCodecContext *enc = ost->enc_ctx;
 int ret = 0;
 
+if (ifilter && abr_threads_enabled)
+if (ost != ifilter->graph->outputs[0])
+continue;
+
 if (!ost->filter || !ost->filter->graph->graph)
 continue;
 filter = ost->filter->filter;
 
 if (!ost->initialized) {
 char error[1024] = "";
+#if HAVE_THREADS
+if (abr_threads_enabled)
+pthread_mutex_lock(_init_mutex);
+#endif
 ret = init_output_stream(ost, error, sizeof(error));
+#if HAVE_THREADS
+if (abr_threads_enabled)
+pthread_mutex_unlock(_init_mutex);
+#endif
 if (ret < 0) {
 av_log(NULL, AV_LOG_ERROR, "Error initializing output stream 
%d:%d -- %s\n",
ost->file_index, 

[FFmpeg-devel] [PATCH v5] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.

2019-02-15 Thread Shaofei Wang
It enabled multiple filter graph concurrency, which bring above about
4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration

Below are some test cases and comparison as reference.
(Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz)
(Software: Intel iHD driver - 16.9.00100, CentOS 7)

For 1:N transcode by GPU acceleration with vaapi:
./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \
-hwaccel_output_format vaapi \
-i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale_vaapi=1280:720" -c:v h264_vaapi -f null /dev/null \
-vf "scale_vaapi=720:480" -c:v h264_vaapi -f null /dev/null

test results:
2 encoders 5 encoders 10 encoders
Improved   6.1%6.9%   5.5%

For 1:N transcode by GPU acceleration with QSV:
./ffmpeg -hwaccel qsv -c:v h264_qsv \
-i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale_qsv=1280:720:format=nv12" -c:v h264_qsv -f null /dev/null \
-vf "scale_qsv=720:480:format=nv12" -c:v h264_qsv -f null /dev/null

test results:
2 encoders  5 encoders 10 encoders
Improved   6%   4% 15%

For Intel GPU acceleration case, 1 decode to N scaling, by QSV:
./ffmpeg -hwaccel qsv -c:v h264_qsv \
-i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale_qsv=1280:720:format=nv12,hwdownload" -pix_fmt nv12 -f null 
/dev/null \
-vf "scale_qsv=720:480:format=nv12,hwdownload" -pix_fmt nv12 -f null 
/dev/null

test results:
2 scale  5 scale   10 scale
Improved   12% 21%21%

For CPU only 1 decode to N scaling:
./ffmpeg -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale=1280:720" -pix_fmt nv12 -f null /dev/null \
-vf "scale=720:480" -pix_fmt nv12 -f null /dev/null

test results:
2 scale  5 scale   10 scale
Improved   25%107%   148%

Signed-off-by: Wang, Shaofei 
Reviewed-by: Zhao, Jun 
---
 fftools/ffmpeg.c| 121 
 fftools/ffmpeg.h|  14 ++
 fftools/ffmpeg_filter.c |   1 +
 3 files changed, 128 insertions(+), 8 deletions(-)

diff --git a/fftools/ffmpeg.c b/fftools/ffmpeg.c
index 544f1a1..676c783 100644
--- a/fftools/ffmpeg.c
+++ b/fftools/ffmpeg.c
@@ -509,6 +509,15 @@ static void ffmpeg_cleanup(int ret)
 }
 av_fifo_freep(>inputs[j]->ist->sub2video.sub_queue);
 }
+#if HAVE_THREADS
+fg->inputs[j]->waited_frm = NULL;
+av_frame_free(>inputs[j]->input_frm);
+pthread_mutex_lock(>inputs[j]->process_mutex);
+fg->inputs[j]->t_end = 1;
+pthread_cond_signal(>inputs[j]->process_cond);
+pthread_mutex_unlock(>inputs[j]->process_mutex);
+pthread_join(fg->inputs[j]->abr_thread, NULL);
+#endif
 av_buffer_unref(>inputs[j]->hw_frames_ctx);
 av_freep(>inputs[j]->name);
 av_freep(>inputs[j]);
@@ -1419,12 +1428,13 @@ static void finish_output_stream(OutputStream *ost)
  *
  * @return  0 for success, <0 for severe errors
  */
-static int reap_filters(int flush)
+static int reap_filters(int flush, InputFilter * ifilter)
 {
 AVFrame *filtered_frame = NULL;
 int i;
 
-/* Reap all buffers present in the buffer sinks */
+/* Reap all buffers present in the buffer sinks or just reap specified
+ * buffer which related with the filter graph who got ifilter as input*/
 for (i = 0; i < nb_output_streams; i++) {
 OutputStream *ost = output_streams[i];
 OutputFile*of = output_files[ost->file_index];
@@ -1436,6 +1446,11 @@ static int reap_filters(int flush)
 continue;
 filter = ost->filter->filter;
 
+if (ifilter) {
+if (ifilter != output_streams[i]->filter->graph->inputs[0])
+continue;
+}
+
 if (!ost->initialized) {
 char error[1024] = "";
 ret = init_output_stream(ost, error, sizeof(error));
@@ -2179,7 +2194,8 @@ static int ifilter_send_frame(InputFilter *ifilter, 
AVFrame *frame)
 }
 }
 
-ret = reap_filters(1);
+ret = HAVE_THREADS ? reap_filters(1, ifilter) : reap_filters(1, NULL);
+
 if (ret < 0 && ret != AVERROR_EOF) {
 av_log(NULL, AV_LOG_ERROR, "Error while filtering: %s\n", 
av_err2str(ret));
 return ret;
@@ -2252,12 +2268,100 @@ static int decode(AVCodecContext *avctx, AVFrame 
*frame, int *got_frame, AVPacke
 return 0;
 }
 
+#if HAVE_THREADS
+static void *filter_pipeline(void *arg)
+{
+InputFilter *fl = arg;
+AVFrame *frm;
+int ret;
+while(1) {
+pthread_mutex_lock(>process_mutex);
+while (fl->waited_frm == NULL && !fl->t_end)
+pthread_cond_wait(>process_cond, >process_mutex);
+pthread_mutex_unlock(>process_mutex);
+
+if (fl->t_end) break;
+
+frm = fl->waited_frm;
+ret = ifilter_send_frame(fl, frm);
+ 

[FFmpeg-devel] [PATCH v4] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.

2019-02-11 Thread Shaofei Wang
It enabled multiple filter graph concurrency, which bring above about
4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration

Below are some test cases and comparison as reference.
(Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz)
(Software: Intel iHD driver - 16.9.00100, CentOS 7)

For 1:N transcode by GPU acceleration with vaapi:
./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \
-hwaccel_output_format vaapi \
-i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale_vaapi=1280:720" -c:v h264_vaapi -f null /dev/null \
-vf "scale_vaapi=720:480" -c:v h264_vaapi -f null /dev/null

test results:
2 encoders 5 encoders 10 encoders
Improved   6.1%6.9%   5.5%

For 1:N transcode by GPU acceleration with QSV:
./ffmpeg -hwaccel qsv -c:v h264_qsv \
-i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale_qsv=1280:720:format=nv12" -c:v h264_qsv -f null /dev/null \
-vf "scale_qsv=720:480:format=nv12" -c:v h264_qsv -f null /dev/null

test results:
2 encoders  5 encoders 10 encoders
Improved   6%   4% 15%

For Intel GPU acceleration case, 1 decode to N scaling, by QSV:
./ffmpeg -hwaccel qsv -c:v h264_qsv \
-i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale_qsv=1280:720:format=nv12,hwdownload" -pix_fmt nv12 -f null 
/dev/null \
-vf "scale_qsv=720:480:format=nv12,hwdownload" -pix_fmt nv12 -f null 
/dev/null

test results:
2 scale  5 scale   10 scale
Improved   12% 21%21%

For CPU only 1 decode to N scaling:
./ffmpeg -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale=1280:720" -pix_fmt nv12 -f null /dev/null \
-vf "scale=720:480" -pix_fmt nv12 -f null /dev/null

test results:
2 scale  5 scale   10 scale
Improved   25%107%   148%

Signed-off-by: Wang, Shaofei 
Reviewed-by: Zhao, Jun 
---
 fftools/ffmpeg.c| 112 +---
 fftools/ffmpeg.h|  14 ++
 fftools/ffmpeg_filter.c |   4 ++
 3 files changed, 124 insertions(+), 6 deletions(-)

diff --git a/fftools/ffmpeg.c b/fftools/ffmpeg.c
index 544f1a1..67b1a2a 100644
--- a/fftools/ffmpeg.c
+++ b/fftools/ffmpeg.c
@@ -1419,13 +1419,18 @@ static void finish_output_stream(OutputStream *ost)
  *
  * @return  0 for success, <0 for severe errors
  */
-static int reap_filters(int flush)
+static int reap_filters(int flush, InputFilter * ifilter)
 {
 AVFrame *filtered_frame = NULL;
 int i;
 
-/* Reap all buffers present in the buffer sinks */
+/* Reap all buffers present in the buffer sinks or just reap specified
+ * input filter buffer */
 for (i = 0; i < nb_output_streams; i++) {
+if (ifilter) {
+if (ifilter != output_streams[i]->filter->graph->inputs[0])
+continue;
+}
 OutputStream *ost = output_streams[i];
 OutputFile*of = output_files[ost->file_index];
 AVFilterContext *filter;
@@ -2179,7 +2184,8 @@ static int ifilter_send_frame(InputFilter *ifilter, 
AVFrame *frame)
 }
 }
 
-ret = reap_filters(1);
+ret = HAVE_THREADS ? reap_filters(1, ifilter) : reap_filters(1, NULL);
+
 if (ret < 0 && ret != AVERROR_EOF) {
 av_log(NULL, AV_LOG_ERROR, "Error while filtering: %s\n", 
av_err2str(ret));
 return ret;
@@ -2208,6 +2214,14 @@ static int ifilter_send_eof(InputFilter *ifilter, 
int64_t pts)
 
 ifilter->eof = 1;
 
+#if HAVE_THREADS
+ifilter->waited_frm = NULL;
+pthread_mutex_lock(>process_mutex);
+ifilter->t_end = 1;
+pthread_cond_signal(>process_cond);
+pthread_mutex_unlock(>process_mutex);
+pthread_join(ifilter->f_thread, NULL);
+#endif
 if (ifilter->filter) {
 ret = av_buffersrc_close(ifilter->filter, pts, AV_BUFFERSRC_FLAG_PUSH);
 if (ret < 0)
@@ -2252,12 +2266,95 @@ static int decode(AVCodecContext *avctx, AVFrame 
*frame, int *got_frame, AVPacke
 return 0;
 }
 
+#if HAVE_THREADS
+static void *filter_pipeline(void *arg)
+{
+InputFilter *fl = arg;
+AVFrame *frm;
+int ret;
+while(1) {
+pthread_mutex_lock(>process_mutex);
+while (fl->waited_frm == NULL && !fl->t_end)
+pthread_cond_wait(>process_cond, >process_mutex);
+pthread_mutex_unlock(>process_mutex);
+
+if (fl->t_end) break;
+
+frm = fl->waited_frm;
+ret = ifilter_send_frame(fl, frm);
+if (ret < 0) {
+av_log(NULL, AV_LOG_ERROR,
+   "Failed to inject frame into filter network: %s\n", 
av_err2str(ret));
+} else {
+ret = reap_filters(0, fl);
+}
+fl->t_error = ret;
+
+pthread_mutex_lock(>finish_mutex);
+fl->waited_frm = NULL;
+pthread_cond_signal(>finish_cond);
+pthread_mutex_unlock(>finish_mutex);
+
+if (ret < 0)
+break;
+}
+

[FFmpeg-devel] [PATCH v3] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.

2019-01-16 Thread Shaofei Wang
With new option "-abr_pipeline"
It enabled multiple filter graph concurrency, which bring obove about
4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration

Below are some test cases and comparison as reference.
(Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz)
(Software: Intel iHD driver - 16.9.00100, CentOS 7)

For 1:N transcode by GPU acceleration with vaapi:
./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \
-hwaccel_output_format vaapi \
-i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale_vaapi=1280:720" -c:v h264_vaapi -f null /dev/null \
-vf "scale_vaapi=720:480" -c:v h264_vaapi -f null /dev/null \
-abr_pipeline

test results:
2 encoders 5 encoders 10 encoders
Improved   6.1%6.9%   5.5%

For 1:N transcode by GPU acceleration with QSV:
./ffmpeg -hwaccel qsv -c:v h264_qsv \
-i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale_qsv=1280:720:format=nv12" -c:v h264_qsv -f null /dev/null \
-vf "scale_qsv=720:480:format=nv12" -c:v h264_qsv -f null /dev/null

test results:
2 encoders  5 encoders 10 encoders
Improved   6%   4% 15%

For Intel GPU acceleration case, 1 decode to N scaling, by QSV:
./ffmpeg -hwaccel qsv -c:v h264_qsv \
-i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale_qsv=1280:720:format=nv12,hwdownload" -pix_fmt nv12 -f null 
/dev/null \
-vf "scale_qsv=720:480:format=nv12,hwdownload" -pix_fmt nv12 -f null 
/dev/null

test results:
2 scale  5 scale   10 scale
Improved   12% 21%21%

For CPU only 1 decode to N scaling:
./ffmpeg -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale=1280:720" -pix_fmt nv12 -f null /dev/null \
-vf "scale=720:480" -pix_fmt nv12 -f null /dev/null \
-abr_pipeline

test results:
2 scale  5 scale   10 scale
Improved   25%107%   148%

Signed-off-by: Wang, Shaofei 
Reviewed-by: Zhao, Jun 
---
 fftools/ffmpeg.c| 228 
 fftools/ffmpeg.h|  15 
 fftools/ffmpeg_filter.c |   4 +
 fftools/ffmpeg_opt.c|   6 +-
 4 files changed, 237 insertions(+), 16 deletions(-)

diff --git a/fftools/ffmpeg.c b/fftools/ffmpeg.c
index 544f1a1..7dbff15 100644
--- a/fftools/ffmpeg.c
+++ b/fftools/ffmpeg.c
@@ -1523,6 +1523,109 @@ static int reap_filters(int flush)
 return 0;
 }
 
+static int pipeline_reap_filters(int flush, InputFilter * ifilter)
+{
+AVFrame *filtered_frame = NULL;
+int i;
+
+for (i = 0; i < nb_output_streams; i++) {
+if (ifilter == output_streams[i]->filter->graph->inputs[0]) break;
+}
+OutputStream *ost = output_streams[i];
+OutputFile*of = output_files[ost->file_index];
+AVFilterContext *filter;
+AVCodecContext *enc = ost->enc_ctx;
+int ret = 0;
+
+if (!ost->filter || !ost->filter->graph->graph)
+return 0;
+filter = ost->filter->filter;
+
+if (!ost->initialized) {
+char error[1024] = "";
+ret = init_output_stream(ost, error, sizeof(error));
+if (ret < 0) {
+av_log(NULL, AV_LOG_ERROR, "Error initializing output stream %d:%d 
-- %s\n",
+   ost->file_index, ost->index, error);
+exit_program(1);
+}
+}
+
+if (!ost->filtered_frame && !(ost->filtered_frame = av_frame_alloc()))
+return AVERROR(ENOMEM);
+filtered_frame = ost->filtered_frame;
+
+while (1) {
+double float_pts = AV_NOPTS_VALUE; // this is identical to 
filtered_frame.pts but with higher precision
+ret = av_buffersink_get_frame_flags(filter, filtered_frame,
+   AV_BUFFERSINK_FLAG_NO_REQUEST);
+if (ret < 0) {
+if (ret != AVERROR(EAGAIN) && ret != AVERROR_EOF) {
+av_log(NULL, AV_LOG_WARNING,
+   "Error in av_buffersink_get_frame_flags(): %s\n", 
av_err2str(ret));
+} else if (flush && ret == AVERROR_EOF) {
+if (av_buffersink_get_type(filter) == AVMEDIA_TYPE_VIDEO)
+do_video_out(of, ost, NULL, AV_NOPTS_VALUE);
+}
+break;
+}
+if (ost->finished) {
+av_frame_unref(filtered_frame);
+continue;
+}
+if (filtered_frame->pts != AV_NOPTS_VALUE) {
+int64_t start_time = (of->start_time == AV_NOPTS_VALUE) ? 0 : 
of->start_time;
+AVRational filter_tb = av_buffersink_get_time_base(filter);
+AVRational tb = enc->time_base;
+int extra_bits = av_clip(29 - av_log2(tb.den), 0, 16);
+
+tb.den <<= extra_bits;
+float_pts =
+av_rescale_q(filtered_frame->pts, filter_tb, tb) -
+av_rescale_q(start_time, AV_TIME_BASE_Q, tb);
+float_pts /= 1 << extra_bits;
+// avoid exact midoints to reduce the chance of 

[FFmpeg-devel] [PATCH v2] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.

2019-01-15 Thread Shaofei Wang
With new option "-abr_pipeline"
It enabled multiple filter graph concurrency, which bring obove about
4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration

Below are some test cases and comparison as reference.
(Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz)
(Software: Intel iHD driver - 16.9.00100, CentOS 7)

For 1:N transcode by GPU acceleration with vaapi:
./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \
-hwaccel_output_format vaapi \
-i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale_vaapi=1280:720" -c:v h264_vaapi -f null /dev/null \
-vf "scale_vaapi=720:480" -c:v h264_vaapi -f null /dev/null \
-abr_pipeline

test results:
2 encoders 5 encoders 10 encoders
Improved   6.1%6.9%   5.5%

For 1:N transcode by GPU acceleration with QSV:
./ffmpeg -hwaccel qsv -c:v h264_qsv \
-i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale_qsv=1280:720:format=nv12" -c:v h264_qsv -f null /dev/null \
-vf "scale_qsv=720:480:format=nv12" -c:v h264_qsv -f null /dev/null

test results:
2 encoders  5 encoders 10 encoders
Improved   6%   4% 15%

For Intel GPU acceleration case, 1 decode to N scaling, by QSV:
./ffmpeg -hwaccel qsv -c:v h264_qsv \
-i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale_qsv=1280:720:format=nv12,hwdownload" -pix_fmt nv12 -f null 
/dev/null \
-vf "scale_qsv=720:480:format=nv12,hwdownload" -pix_fmt nv12 -f null 
/dev/null

test results:
2 scale  5 scale   10 scale
Improved   12% 21%21%

For CPU only 1 decode to N scaling:
./ffmpeg -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale=1280:720" -pix_fmt nv12 -f null /dev/null \
-vf "scale=720:480" -pix_fmt nv12 -f null /dev/null \
-abr_pipeline

test results:
2 scale  5 scale   10 scale
Improved   25%107%   148%

Signed-off-by: Wang, Shaofei 
Reviewed-by: Zhao, Jun 
---
 fftools/ffmpeg.c| 238 +---
 fftools/ffmpeg.h|  15 +++
 fftools/ffmpeg_filter.c |   6 ++
 fftools/ffmpeg_opt.c|   6 +-
 4 files changed, 251 insertions(+), 14 deletions(-)

diff --git a/fftools/ffmpeg.c b/fftools/ffmpeg.c
index 544f1a1..d608194 100644
--- a/fftools/ffmpeg.c
+++ b/fftools/ffmpeg.c
@@ -1523,6 +1523,110 @@ static int reap_filters(int flush)
 return 0;
 }
 
+static int pipeline_reap_filters(int flush, InputFilter * ifilter)
+{
+AVFrame *filtered_frame = NULL;
+int i;
+
+for (i = 0; i < nb_output_streams; i++) {
+if (ifilter == output_streams[i]->filter->graph->inputs[0]) break;
+}
+OutputStream *ost = output_streams[i];
+OutputFile*of = output_files[ost->file_index];
+AVFilterContext *filter;
+AVCodecContext *enc = ost->enc_ctx;
+int ret = 0;
+
+if (!ost->filter || !ost->filter->graph->graph)
+return 0;
+filter = ost->filter->filter;
+
+if (!ost->initialized) {
+char error[1024] = "";
+ret = init_output_stream(ost, error, sizeof(error));
+if (ret < 0) {
+av_log(NULL, AV_LOG_ERROR, "Error initializing output stream %d:%d 
-- %s\n",
+   ost->file_index, ost->index, error);
+exit_program(1);
+}
+}
+
+if (!ost->filtered_frame && !(ost->filtered_frame = av_frame_alloc())) {
+return AVERROR(ENOMEM);
+}
+filtered_frame = ost->filtered_frame;
+
+while (1) {
+double float_pts = AV_NOPTS_VALUE; // this is identical to 
filtered_frame.pts but with higher precision
+ret = av_buffersink_get_frame_flags(filter, filtered_frame,
+   AV_BUFFERSINK_FLAG_NO_REQUEST);
+if (ret < 0) {
+if (ret != AVERROR(EAGAIN) && ret != AVERROR_EOF) {
+av_log(NULL, AV_LOG_WARNING,
+   "Error in av_buffersink_get_frame_flags(): %s\n", 
av_err2str(ret));
+} else if (flush && ret == AVERROR_EOF) {
+if (av_buffersink_get_type(filter) == AVMEDIA_TYPE_VIDEO)
+do_video_out(of, ost, NULL, AV_NOPTS_VALUE);
+}
+break;
+}
+if (ost->finished) {
+av_frame_unref(filtered_frame);
+continue;
+}
+if (filtered_frame->pts != AV_NOPTS_VALUE) {
+int64_t start_time = (of->start_time == AV_NOPTS_VALUE) ? 0 : 
of->start_time;
+AVRational filter_tb = av_buffersink_get_time_base(filter);
+AVRational tb = enc->time_base;
+int extra_bits = av_clip(29 - av_log2(tb.den), 0, 16);
+
+tb.den <<= extra_bits;
+float_pts =
+av_rescale_q(filtered_frame->pts, filter_tb, tb) -
+av_rescale_q(start_time, AV_TIME_BASE_Q, tb);
+float_pts /= 1 << extra_bits;
+// avoid exact midoints to reduce the chance 

[FFmpeg-devel] [PATCH] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.

2019-01-08 Thread Shaofei Wang
With new option "-abr_pipeline"
It enabled multiple filter graph concurrency, which bring obove about
4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration

Below are some test cases and comparison as reference.
(Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz)
(Software: Intel iHD driver - 16.9.00100, CentOS 7)

For 1:N transcode by GPU acceleration with vaapi:
./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \
-hwaccel_output_format vaapi \
-i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale_vaapi=1280:720" -c:v h264_vaapi -f null /dev/null \
-vf "scale_vaapi=720:480" -c:v h264_vaapi -f null /dev/null \
-abr_pipeline

test results:
2 encoders 5 encoders 10 encoders
Improved   6.1%6.9%   5.5%

For 1:N transcode by GPU acceleration with QSV:
./ffmpeg -hwaccel qsv -c:v h264_qsv \
-i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale_qsv=1280:720:format=nv12" -c:v h264_qsv -f null /dev/null \
-vf "scale_qsv=720:480:format=nv12" -c:v h264_qsv -f null /dev/null

test results:
2 encoders  5 encoders 10 encoders
Improved   6%   4% 15%

For Intel GPU acceleration case, 1 decode to N scaling, by QSV:
./ffmpeg -hwaccel qsv -c:v h264_qsv \
-i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale_qsv=1280:720:format=nv12,hwdownload" -pix_fmt nv12 -f null 
/dev/null \
-vf "scale_qsv=720:480:format=nv12,hwdownload" -pix_fmt nv12 -f null 
/dev/null

test results:
2 scale  5 scale   10 scale
Improved   12% 21%21%

For CPU only 1 decode to N scaling:
./ffmpeg -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale=1280:720" -pix_fmt nv12 -f null /dev/null \
-vf "scale=720:480" -pix_fmt nv12 -f null /dev/null \
-abr_pipeline

test results:
2 scale  5 scale   10 scale
Improved   25%107%   148%

Signed-off-by: Wang, Shaofei 
Reviewed-by: Zhao, Jun 
---
 fftools/ffmpeg.c| 239 +---
 fftools/ffmpeg.h|  14 +++
 fftools/ffmpeg_filter.c |   6 ++
 fftools/ffmpeg_opt.c|   6 +-
 4 files changed, 251 insertions(+), 14 deletions(-)

diff --git a/fftools/ffmpeg.c b/fftools/ffmpeg.c
index 544f1a1..f7a41fe 100644
--- a/fftools/ffmpeg.c
+++ b/fftools/ffmpeg.c
@@ -1523,6 +1523,112 @@ static int reap_filters(int flush)
 return 0;
 }
 
+static int pipeline_reap_filters(int flush, InputFilter * ifilter)
+{
+AVFrame *filtered_frame = NULL;
+int i;
+
+for (i = 0; i < nb_output_streams; i++) {
+if (ifilter == output_streams[i]->filter->graph->inputs[0]) break;
+}
+OutputStream *ost = output_streams[i];
+OutputFile*of = output_files[ost->file_index];
+AVFilterContext *filter;
+AVCodecContext *enc = ost->enc_ctx;
+int ret = 0;
+
+if (!ost->filter || !ost->filter->graph->graph)
+return 0;
+filter = ost->filter->filter;
+
+if (!ost->initialized) {
+char error[1024] = "";
+ret = init_output_stream(ost, error, sizeof(error));
+if (ret < 0) {
+av_log(NULL, AV_LOG_ERROR, "Error initializing output stream %d:%d 
-- %s\n",
+   ost->file_index, ost->index, error);
+exit_program(1);
+}
+}
+
+if (!ost->filtered_frame && !(ost->filtered_frame = av_frame_alloc())) {
+return AVERROR(ENOMEM);
+}
+filtered_frame = ost->filtered_frame;
+
+while (1) {
+double float_pts = AV_NOPTS_VALUE; // this is identical to 
filtered_frame.pts but with higher precision
+ret = av_buffersink_get_frame_flags(filter, filtered_frame,
+   AV_BUFFERSINK_FLAG_NO_REQUEST);
+if (ret < 0) {
+if (ret != AVERROR(EAGAIN) && ret != AVERROR_EOF) {
+av_log(NULL, AV_LOG_WARNING,
+   "Error in av_buffersink_get_frame_flags(): %s\n", 
av_err2str(ret));
+} else if (flush && ret == AVERROR_EOF) {
+if (av_buffersink_get_type(filter) == AVMEDIA_TYPE_VIDEO)
+do_video_out(of, ost, NULL, AV_NOPTS_VALUE);
+}
+break;
+}
+if (ost->finished) {
+av_frame_unref(filtered_frame);
+continue;
+}
+if (filtered_frame->pts != AV_NOPTS_VALUE) {
+int64_t start_time = (of->start_time == AV_NOPTS_VALUE) ? 0 : 
of->start_time;
+AVRational filter_tb = av_buffersink_get_time_base(filter);
+AVRational tb = enc->time_base;
+int extra_bits = av_clip(29 - av_log2(tb.den), 0, 16);
+
+tb.den <<= extra_bits;
+float_pts =
+av_rescale_q(filtered_frame->pts, filter_tb, tb) -
+av_rescale_q(start_time, AV_TIME_BASE_Q, tb);
+float_pts /= 1 << extra_bits;
+// avoid exact midoints to reduce the chance 

[FFmpeg-devel] [PATCH] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.

2018-12-31 Thread Shaofei Wang
With new option "-abr_pipeline"
It enabled multiple filter graph concurrency, which bring obvious
improvement in some 1:N scenarios by CPU and GPU acceleration

Below are some test cases and comparison as reference.
(Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz)
(Software: Intel iHD driver - 16.9.00100, CentOS 7)

For Intel GPU acceleration case, 1 decode to N scaling, by vaapi:
./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \
-hwaccel_output_format vaapi \
-i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale_vaapi=1280:720:format=nv12,hwdownload" \
-pix_fmt nv12 -f null /dev/null \
-vf "scale_vaapi=720:480:format=nv12,hwdownload" \
-pix_fmt nv12 -f null /dev/null \
-abr_pipeline

test results:
2 scale  5 scale   10 scale
Improved   34%184%   240%

For Intel GPU acceleration case, 1 decode to N scaling, by QSV:
./ffmpeg -hwaccel qsv -c:v h264_qsv \
-i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale_qsv=1280:720:format=nv12,hwdownload" -pix_fmt nv12 -f null 
/dev/null \
-vf "scale_qsv=720:480:format=nv12,hwdownload" -pix_fmt nv12 -f null 
/dev/null

test results:
2 scale  5 scale   10 scale
Improved   12% 21%21%

For CPU only 1 decode to N scaling:
./ffmpeg -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale=1280:720" -pix_fmt nv12 -f null /dev/null \
-vf "scale=720:480" -pix_fmt nv12 -f null /dev/null \
-abr_pipeline

test results:
2 scale  5 scale   10 scale
Improved   25%107%   148%

For 1:N transcode by GPU acceleration with vaapi:
./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \
-hwaccel_output_format vaapi \
-i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale_vaapi=1280:720" -c:v h264_vaapi -f null /dev/null \
-vf "scale_vaapi=720:480" -c:v h264_vaapi -f null /dev/null \
-abr_pipeline

test results:
2 encoders 5 encoders 10 encoders
Improved   6.1%6.9%   5.5%

For 1:N transcode by GPU acceleration with QSV:
./ffmpeg -hwaccel qsv -c:v h264_qsv \
-i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
-vf "scale_qsv=1280:720:format=nv12" -c:v h264_qsv -f null /dev/null \
-vf "scale_qsv=720:480:format=nv12" -c:v h264_qsv -f null /dev/null

test results:
2 encoders  5 encoders 10 encoders
Improved   6%   4% 15%

Signed-off-by: Wang, Shaofei 
Reviewed-by: Zhao, Jun 
---
 fftools/ffmpeg.c| 239 +---
 fftools/ffmpeg.h|  12 +++
 fftools/ffmpeg_filter.c |   6 ++
 fftools/ffmpeg_opt.c|   6 +-
 4 files changed, 249 insertions(+), 14 deletions(-)

diff --git a/fftools/ffmpeg.c b/fftools/ffmpeg.c
index 544f1a1..6131782 100644
--- a/fftools/ffmpeg.c
+++ b/fftools/ffmpeg.c
@@ -1523,6 +1523,112 @@ static int reap_filters(int flush)
 return 0;
 }
 
+static int pipeline_reap_filters(int flush, InputFilter * ifilter)
+{
+AVFrame *filtered_frame = NULL;
+int i;
+
+for (i = 0; i < nb_output_streams; i++) {
+if (ifilter == output_streams[i]->filter->graph->inputs[0]) break;
+}
+OutputStream *ost = output_streams[i];
+OutputFile*of = output_files[ost->file_index];
+AVFilterContext *filter;
+AVCodecContext *enc = ost->enc_ctx;
+int ret = 0;
+
+if (!ost->filter || !ost->filter->graph->graph)
+return 0;
+filter = ost->filter->filter;
+
+if (!ost->initialized) {
+char error[1024] = "";
+ret = init_output_stream(ost, error, sizeof(error));
+if (ret < 0) {
+av_log(NULL, AV_LOG_ERROR, "Error initializing output stream %d:%d 
-- %s\n",
+   ost->file_index, ost->index, error);
+exit_program(1);
+}
+}
+
+if (!ost->filtered_frame && !(ost->filtered_frame = av_frame_alloc())) {
+return AVERROR(ENOMEM);
+}
+filtered_frame = ost->filtered_frame;
+
+while (1) {
+double float_pts = AV_NOPTS_VALUE; // this is identical to 
filtered_frame.pts but with higher precision
+ret = av_buffersink_get_frame_flags(filter, filtered_frame,
+   AV_BUFFERSINK_FLAG_NO_REQUEST);
+if (ret < 0) {
+if (ret != AVERROR(EAGAIN) && ret != AVERROR_EOF) {
+av_log(NULL, AV_LOG_WARNING,
+   "Error in av_buffersink_get_frame_flags(): %s\n", 
av_err2str(ret));
+} else if (flush && ret == AVERROR_EOF) {
+if (av_buffersink_get_type(filter) == AVMEDIA_TYPE_VIDEO)
+do_video_out(of, ost, NULL, AV_NOPTS_VALUE);
+}
+break;
+}
+if (ost->finished) {
+av_frame_unref(filtered_frame);
+continue;
+}
+if (filtered_frame->pts != AV_NOPTS_VALUE) {
+int64_t start_time = (of->start_time ==