Re: [x265] Custom LowRes scale

2014-07-22 Thread Nicolas Morey-Chaisemartin

I can do that :)
Do you have standard way to generate these figures? (Video, options ?)
Or shall I just generate a couple of figures to put in the commit?

On 07/21/2014 06:16 PM, Deepthi Nandakumar wrote:

Thanks, this is certainly an enhancement to x265 lookahead. We would be
interested in this - especially if you can also include some efficiency
(bitrate vs SSIM) metrics that describe the penalty moving from
X265_LOWRES_SCALE of 4 to higher scales.


On Mon, Jul 21, 2014 at 8:49 PM, Nicolas Morey-Chaisemartin 
nmo...@kalray.eu wrote:


Hi,

We recently profiled x265 pre-analysis to estimate what performance we
could reach using our accelerator and I was quite disappointed by the
performance.
When running on a Core-i7 with AVX at roughly 2.7GHz, we barely reached
the 30fps mark using ultrafast preset on a 4K video.




After a little bit of browsing I realized that work in LosRew is always
done at 1/4th of the final resolution which seems fair but requires a huge
amount of work for 4K.
It seemed straight forward enough to change the divider at LowRes
initialization but it seems there are a lot of hard coded values that
depend both on the LowRes divider and the LowRes CU Size.

Here's a patch (definitly not applicable like this but just to give an
idea of where I'm going) that seems to fix most of the hard-coded value.
It still works with a X265_LOWRES_SCALE of 4 and the perf is definilty
improving (29fps = 40fps on a 2048x1024 medium preset on a E5504).

Would you be interested in a clean version of this? At least the
hard-coded CU_SIZE part?
IMHO it would be better to have dynamic value for LowRes depending on
preset (or equivalent) and the input resolution...
1/4th is fast enough in HD not to be an issue but for RT stream in 4K or
more, 1/16 will be compulsory.

Nicolas

---
  x265/source/common/common.h  |  1 +
  x265/source/common/lowres.cpp|  4 ++--
  x265/source/encoder/frameencoder.cpp |  7 ---
  x265/source/encoder/ratecontrol.cpp  | 16 
  x265/source/encoder/slicetype.cpp|  8 
  5 files changed, 19 insertions(+), 17 deletions(-)

diff --git a/x265/source/common/common.h b/x265/source/common/common.h
index 06f60e7..00e73fc 100644
--- a/x265/source/common/common.h
+++ b/x265/source/common/common.h
@@ -156,6 +156,7 @@ typedef int32_t  coeff_t;  // transform coefficient
  // high cost estimates (intra and inter both suffer)
  #define X265_LOWRES_CU_SIZE   8
  #define X265_LOWRES_CU_BITS   3
+#define X265_LOWRES_SCALE 2
   #define X265_MALLOC(type, count)(type*)x265_malloc(sizeof(type) *
(count))
  #define X265_FREE(ptr)  x265_free(ptr)
diff --git a/x265/source/common/lowres.cpp b/x265/source/common/lowres.cpp
index 5fc2f6b..6138023 100644
--- a/x265/source/common/lowres.cpp
+++ b/x265/source/common/lowres.cpp
@@ -31,8 +31,8 @@ bool Lowres::create(TComPicYuv *orig, int _bframes, bool
bAQEnabled)
  {
  isLowres = true;
  bframes = _bframes;
-width = orig-getWidth() / 2;
-lines = orig-getHeight() / 2;
+width = orig-getWidth() / X265_LOWRES_SCALE;
+lines = orig-getHeight() / X265_LOWRES_SCALE;
  lumaStride = width + 2 * orig-getLumaMarginX();
  if (lumaStride  31)
  lumaStride += 32 - (lumaStride  31);
diff --git a/x265/source/encoder/frameencoder.cpp b/x265/source/encoder/
frameencoder.cpp
index 8c3ee26..7213f60 100644
--- a/x265/source/encoder/frameencoder.cpp
+++ b/x265/source/encoder/frameencoder.cpp
@@ -1300,9 +1300,10 @@ int FrameEncoder::calcQpForCu(uint32_t cuAddr,
double baseQp)
   /* Derive qpOffet for each CU by averaging offsets for all 16x16
blocks in the cu. */
  double qp_offset = 0;
-int maxBlockCols = (m_frame-getPicYuvOrg()-getWidth() + (16 - 1))
/ 16;
-int maxBlockRows = (m_frame-getPicYuvOrg()-getHeight() + (16 - 1))
/ 16;
-int noOfBlocks = g_maxCUSize / 16;
+int lowResCu = (X265_LOWRES_CU_SIZE * X265_LOWRES_SCALE);
+int maxBlockCols = (m_frame-getPicYuvOrg()-getWidth() + (lowResCu
- 1)) / lowResCu;
+int maxBlockRows = (m_frame-getPicYuvOrg()-getHeight() + (lowResCu
- 1)) / lowResCu;
+int noOfBlocks = g_maxCUSize / lowResCu;
  int block_y = (cuAddr / m_frame-getPicSym()-getFrameWidthInCU()) *
noOfBlocks;
  int block_x = (cuAddr * noOfBlocks) - block_y * m_frame-getPicSym()-
getFrameWidthInCU();
  diff --git a/x265/source/encoder/ratecontrol.cpp b/x265/source/encoder/
ratecontrol.cpp
index 4358994..5fcc27a 100644
--- a/x265/source/encoder/ratecontrol.cpp
+++ b/x265/source/encoder/ratecontrol.cpp
@@ -161,8 +161,8 @@ void RateControl::calcAdaptiveQuantFrame(Frame *pic)
  if (m_param-rc.aqMode == X265_AQ_NONE || m_param-rc.aqStrength == 0)
  {
  /* Need to init it anyways for CU tree */
-int cuWidth = ((maxCol / 2) + X265_LOWRES_CU_SIZE - 1) 
X265_LOWRES_CU_BITS;
-int cuHeight = ((maxRow / 2) + X265_LOWRES_CU_SIZE - 1) 
X265_LOWRES_CU_BITS;
+int cuWidth = ((maxCol / X265_LOWRES_SCALE) + 

Re: [x265] Custom LowRes scale

2014-07-22 Thread Nicolas Morey-Chaisemartin


On 07/21/2014 07:11 PM, Steve Borho wrote:

Interesting. I imagine much 4k content would work decently well even
with further downscaling of the lookahead pictures.

The lowres motion vectors are used in weight analysis as well, so that
file would need to be updated.

I'll have a look at it. It doesn't semm as straight forward as the other files 
though.
While we're talking about lowres MV: from what I could gather they are not used 
during the motionSearch on the full res picture.
As a lot of time is spent finding those, whouldn't it be useful to add them as 
candidate in the fullres search?

Nicolas

___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] Custom LowRes scale

2014-07-22 Thread Steve Borho
On 07/22, Nicolas Morey-Chaisemartin wrote:
 
 On 07/21/2014 07:11 PM, Steve Borho wrote:
 Interesting. I imagine much 4k content would work decently well even
 with further downscaling of the lookahead pictures.
 
 The lowres motion vectors are used in weight analysis as well, so that
 file would need to be updated.

 I'll have a look at it. It doesn't semm as straight forward as the
 other files though.

it is slightly more complicated; you'll want to scale up the block sizes
used for motion-compensated weight analysis - up to 32x32 or 64x64 based
on how much further you downscale the lowres in lookahead.

 While we're talking about lowres MV: from what I could gather they are
 not used during the motionSearch on the full res picture.  As a lot of
 time is spent finding those, whouldn't it be useful to add them as
 candidate in the fullres search?

This has been on my TODO list for ages; a couple of people have claimed
they've tried it and it hasn't helped as much as you might think.  But I
haven't had a working patch in hand to verify it.

The AMVP fixup after motion search, where we get to go shopping for a
better MVP after the search, often makes extra motion candidates
superfluous.

-- 
Steve Borho
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] Custom LowRes scale

2014-07-22 Thread Nicolas Morey-Chaisemartin


On 07/22/2014 10:08 AM, Steve Borho wrote:

On 07/22, Nicolas Morey-Chaisemartin wrote:

I'll have a look at it. It doesn't semm as straight forward as the
other files though.

it is slightly more complicated; you'll want to scale up the block sizes
used for motion-compensated weight analysis - up to 32x32 or 64x64 based
on how much further you downscale the lowres in lookahead.


Is there a clean way to get a LUMA_NNxNN value from a block size ?
Should I handle block larger than 64x64 by looping on the 64x64 blocks? or simply 
add a check at lowres init that the fullres CU size is = 64 ?




While we're talking about lowres MV: from what I could gather they are
not used during the motionSearch on the full res picture.  As a lot of
time is spent finding those, whouldn't it be useful to add them as
candidate in the fullres search?

This has been on my TODO list for ages; a couple of people have claimed
they've tried it and it hasn't helped as much as you might think.  But I
haven't had a working patch in hand to verify it.

The AMVP fixup after motion search, where we get to go shopping for a
better MVP after the search, often makes extra motion candidates
superfluous.


I started working on this yesterday for our accelerator but I got carried away 
on lowres scaling.
I haven't any results yet but I'll post them as soon as I have some.
By the way, lowres MV are in lowres luma pixels right? So I'll need to scale 
the vector by 2 to get the full MV?

Nicolas

___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


[x265] Custom LowRes scale

2014-07-21 Thread Nicolas Morey-Chaisemartin

Hi,

We recently profiled x265 pre-analysis to estimate what performance we could 
reach using our accelerator and I was quite disappointed by the performance.
When running on a Core-i7 with AVX at roughly 2.7GHz, we barely reached the 
30fps mark using ultrafast preset on a 4K video.

After a little bit of browsing I realized that work in LosRew is always done at 
1/4th of the final resolution which seems fair but requires a huge amount of 
work for 4K.
It seemed straight forward enough to change the divider at LowRes 
initialization but it seems there are a lot of hard coded values that depend 
both on the LowRes divider and the LowRes CU Size.

Here's a patch (definitly not applicable like this but just to give an idea of 
where I'm going) that seems to fix most of the hard-coded value.
It still works with a X265_LOWRES_SCALE of 4 and the perf is definilty improving 
(29fps = 40fps on a 2048x1024 medium preset on a E5504).

Would you be interested in a clean version of this? At least the hard-coded 
CU_SIZE part?
IMHO it would be better to have dynamic value for LowRes depending on preset 
(or equivalent) and the input resolution...
1/4th is fast enough in HD not to be an issue but for RT stream in 4K or more, 
1/16 will be compulsory.

Nicolas

---
 x265/source/common/common.h  |  1 +
 x265/source/common/lowres.cpp|  4 ++--
 x265/source/encoder/frameencoder.cpp |  7 ---
 x265/source/encoder/ratecontrol.cpp  | 16 
 x265/source/encoder/slicetype.cpp|  8 
 5 files changed, 19 insertions(+), 17 deletions(-)

diff --git a/x265/source/common/common.h b/x265/source/common/common.h
index 06f60e7..00e73fc 100644
--- a/x265/source/common/common.h
+++ b/x265/source/common/common.h
@@ -156,6 +156,7 @@ typedef int32_t  coeff_t;  // transform coefficient
 // high cost estimates (intra and inter both suffer)
 #define X265_LOWRES_CU_SIZE   8
 #define X265_LOWRES_CU_BITS   3
+#define X265_LOWRES_SCALE 2
 
 #define X265_MALLOC(type, count)(type*)x265_malloc(sizeof(type) * (count))

 #define X265_FREE(ptr)  x265_free(ptr)
diff --git a/x265/source/common/lowres.cpp b/x265/source/common/lowres.cpp
index 5fc2f6b..6138023 100644
--- a/x265/source/common/lowres.cpp
+++ b/x265/source/common/lowres.cpp
@@ -31,8 +31,8 @@ bool Lowres::create(TComPicYuv *orig, int _bframes, bool 
bAQEnabled)
 {
 isLowres = true;
 bframes = _bframes;
-width = orig-getWidth() / 2;
-lines = orig-getHeight() / 2;
+width = orig-getWidth() / X265_LOWRES_SCALE;
+lines = orig-getHeight() / X265_LOWRES_SCALE;
 lumaStride = width + 2 * orig-getLumaMarginX();
 if (lumaStride  31)
 lumaStride += 32 - (lumaStride  31);
diff --git a/x265/source/encoder/frameencoder.cpp 
b/x265/source/encoder/frameencoder.cpp
index 8c3ee26..7213f60 100644
--- a/x265/source/encoder/frameencoder.cpp
+++ b/x265/source/encoder/frameencoder.cpp
@@ -1300,9 +1300,10 @@ int FrameEncoder::calcQpForCu(uint32_t cuAddr, double 
baseQp)
 
 /* Derive qpOffet for each CU by averaging offsets for all 16x16 blocks in the cu. */

 double qp_offset = 0;
-int maxBlockCols = (m_frame-getPicYuvOrg()-getWidth() + (16 - 1)) / 16;
-int maxBlockRows = (m_frame-getPicYuvOrg()-getHeight() + (16 - 1)) / 16;
-int noOfBlocks = g_maxCUSize / 16;
+int lowResCu = (X265_LOWRES_CU_SIZE * X265_LOWRES_SCALE);
+int maxBlockCols = (m_frame-getPicYuvOrg()-getWidth() + (lowResCu - 1)) 
/ lowResCu;
+int maxBlockRows = (m_frame-getPicYuvOrg()-getHeight() + (lowResCu - 1)) 
/ lowResCu;
+int noOfBlocks = g_maxCUSize / lowResCu;
 int block_y = (cuAddr / m_frame-getPicSym()-getFrameWidthInCU()) * 
noOfBlocks;
 int block_x = (cuAddr * noOfBlocks) - block_y * 
m_frame-getPicSym()-getFrameWidthInCU();
 
diff --git a/x265/source/encoder/ratecontrol.cpp b/x265/source/encoder/ratecontrol.cpp

index 4358994..5fcc27a 100644
--- a/x265/source/encoder/ratecontrol.cpp
+++ b/x265/source/encoder/ratecontrol.cpp
@@ -161,8 +161,8 @@ void RateControl::calcAdaptiveQuantFrame(Frame *pic)
 if (m_param-rc.aqMode == X265_AQ_NONE || m_param-rc.aqStrength == 0)
 {
 /* Need to init it anyways for CU tree */
-int cuWidth = ((maxCol / 2) + X265_LOWRES_CU_SIZE - 1)  
X265_LOWRES_CU_BITS;
-int cuHeight = ((maxRow / 2) + X265_LOWRES_CU_SIZE - 1)  
X265_LOWRES_CU_BITS;
+int cuWidth = ((maxCol / X265_LOWRES_SCALE) + X265_LOWRES_CU_SIZE - 1) 
 X265_LOWRES_CU_BITS;
+int cuHeight = ((maxRow / X265_LOWRES_SCALE) + X265_LOWRES_CU_SIZE - 1) 
 X265_LOWRES_CU_BITS;
 int cuCount = cuWidth * cuHeight;
 
 if (m_param-rc.aqMode  m_param-rc.aqStrength == 0)

@@ -194,9 +194,9 @@ void RateControl::calcAdaptiveQuantFrame(Frame *pic)
 if (m_param-rc.aqMode == X265_AQ_AUTO_VARIANCE)
 {
 double bit_depth_correction = pow(1  (X265_DEPTH - 8), 0.5);
-for (block_y = 0; block_y  maxRow; block_y += 16)
+for 

Re: [x265] Custom LowRes scale

2014-07-21 Thread Deepthi Nandakumar
Thanks, this is certainly an enhancement to x265 lookahead. We would be
interested in this - especially if you can also include some efficiency
(bitrate vs SSIM) metrics that describe the penalty moving from
X265_LOWRES_SCALE of 4 to higher scales.


On Mon, Jul 21, 2014 at 8:49 PM, Nicolas Morey-Chaisemartin 
nmo...@kalray.eu wrote:

 Hi,

 We recently profiled x265 pre-analysis to estimate what performance we
 could reach using our accelerator and I was quite disappointed by the
 performance.
 When running on a Core-i7 with AVX at roughly 2.7GHz, we barely reached
 the 30fps mark using ultrafast preset on a 4K video.



 After a little bit of browsing I realized that work in LosRew is always
 done at 1/4th of the final resolution which seems fair but requires a huge
 amount of work for 4K.
 It seemed straight forward enough to change the divider at LowRes
 initialization but it seems there are a lot of hard coded values that
 depend both on the LowRes divider and the LowRes CU Size.

 Here's a patch (definitly not applicable like this but just to give an
 idea of where I'm going) that seems to fix most of the hard-coded value.
 It still works with a X265_LOWRES_SCALE of 4 and the perf is definilty
 improving (29fps = 40fps on a 2048x1024 medium preset on a E5504).

 Would you be interested in a clean version of this? At least the
 hard-coded CU_SIZE part?
 IMHO it would be better to have dynamic value for LowRes depending on
 preset (or equivalent) and the input resolution...
 1/4th is fast enough in HD not to be an issue but for RT stream in 4K or
 more, 1/16 will be compulsory.

 Nicolas

 ---
  x265/source/common/common.h  |  1 +
  x265/source/common/lowres.cpp|  4 ++--
  x265/source/encoder/frameencoder.cpp |  7 ---
  x265/source/encoder/ratecontrol.cpp  | 16 
  x265/source/encoder/slicetype.cpp|  8 
  5 files changed, 19 insertions(+), 17 deletions(-)

 diff --git a/x265/source/common/common.h b/x265/source/common/common.h
 index 06f60e7..00e73fc 100644
 --- a/x265/source/common/common.h
 +++ b/x265/source/common/common.h
 @@ -156,6 +156,7 @@ typedef int32_t  coeff_t;  // transform coefficient
  // high cost estimates (intra and inter both suffer)
  #define X265_LOWRES_CU_SIZE   8
  #define X265_LOWRES_CU_BITS   3
 +#define X265_LOWRES_SCALE 2
   #define X265_MALLOC(type, count)(type*)x265_malloc(sizeof(type) *
 (count))
  #define X265_FREE(ptr)  x265_free(ptr)
 diff --git a/x265/source/common/lowres.cpp b/x265/source/common/lowres.cpp
 index 5fc2f6b..6138023 100644
 --- a/x265/source/common/lowres.cpp
 +++ b/x265/source/common/lowres.cpp
 @@ -31,8 +31,8 @@ bool Lowres::create(TComPicYuv *orig, int _bframes, bool
 bAQEnabled)
  {
  isLowres = true;
  bframes = _bframes;
 -width = orig-getWidth() / 2;
 -lines = orig-getHeight() / 2;
 +width = orig-getWidth() / X265_LOWRES_SCALE;
 +lines = orig-getHeight() / X265_LOWRES_SCALE;
  lumaStride = width + 2 * orig-getLumaMarginX();
  if (lumaStride  31)
  lumaStride += 32 - (lumaStride  31);
 diff --git a/x265/source/encoder/frameencoder.cpp b/x265/source/encoder/
 frameencoder.cpp
 index 8c3ee26..7213f60 100644
 --- a/x265/source/encoder/frameencoder.cpp
 +++ b/x265/source/encoder/frameencoder.cpp
 @@ -1300,9 +1300,10 @@ int FrameEncoder::calcQpForCu(uint32_t cuAddr,
 double baseQp)
   /* Derive qpOffet for each CU by averaging offsets for all 16x16
 blocks in the cu. */
  double qp_offset = 0;
 -int maxBlockCols = (m_frame-getPicYuvOrg()-getWidth() + (16 - 1))
 / 16;
 -int maxBlockRows = (m_frame-getPicYuvOrg()-getHeight() + (16 - 1))
 / 16;
 -int noOfBlocks = g_maxCUSize / 16;
 +int lowResCu = (X265_LOWRES_CU_SIZE * X265_LOWRES_SCALE);
 +int maxBlockCols = (m_frame-getPicYuvOrg()-getWidth() + (lowResCu
 - 1)) / lowResCu;
 +int maxBlockRows = (m_frame-getPicYuvOrg()-getHeight() + (lowResCu
 - 1)) / lowResCu;
 +int noOfBlocks = g_maxCUSize / lowResCu;
  int block_y = (cuAddr / m_frame-getPicSym()-getFrameWidthInCU()) *
 noOfBlocks;
  int block_x = (cuAddr * noOfBlocks) - block_y * m_frame-getPicSym()-
 getFrameWidthInCU();
  diff --git a/x265/source/encoder/ratecontrol.cpp b/x265/source/encoder/
 ratecontrol.cpp
 index 4358994..5fcc27a 100644
 --- a/x265/source/encoder/ratecontrol.cpp
 +++ b/x265/source/encoder/ratecontrol.cpp
 @@ -161,8 +161,8 @@ void RateControl::calcAdaptiveQuantFrame(Frame *pic)
  if (m_param-rc.aqMode == X265_AQ_NONE || m_param-rc.aqStrength == 0)
  {
  /* Need to init it anyways for CU tree */
 -int cuWidth = ((maxCol / 2) + X265_LOWRES_CU_SIZE - 1) 
 X265_LOWRES_CU_BITS;
 -int cuHeight = ((maxRow / 2) + X265_LOWRES_CU_SIZE - 1) 
 X265_LOWRES_CU_BITS;
 +int cuWidth = ((maxCol / X265_LOWRES_SCALE) + X265_LOWRES_CU_SIZE
 - 1)  X265_LOWRES_CU_BITS;
 +int cuHeight = ((maxRow / X265_LOWRES_SCALE) +
 X265_LOWRES_CU_SIZE - 1)  

Re: [x265] Custom LowRes scale

2014-07-21 Thread Steve Borho
On 07/21, Nicolas Morey-Chaisemartin wrote:
 Hi,
 
 We recently profiled x265 pre-analysis to estimate what performance we
 could reach using our accelerator and I was quite disappointed by the
 performance.  When running on a Core-i7 with AVX at roughly 2.7GHz, we
 barely reached the 30fps mark using ultrafast preset on a 4K video.
 
 After a little bit of browsing I realized that work in LosRew is
 always done at 1/4th of the final resolution which seems fair but
 requires a huge amount of work for 4K.  It seemed straight forward
 enough to change the divider at LowRes initialization but it seems
 there are a lot of hard coded values that depend both on the LowRes
 divider and the LowRes CU Size.
 
 Here's a patch (definitly not applicable like this but just to give an
 idea of where I'm going) that seems to fix most of the hard-coded
 value.  It still works with a X265_LOWRES_SCALE of 4 and the perf is
 definilty improving (29fps = 40fps on a 2048x1024 medium preset on a
 E5504).
 
 Would you be interested in a clean version of this? At least the
 hard-coded CU_SIZE part?  IMHO it would be better to have dynamic
 value for LowRes depending on preset (or equivalent) and the input
 resolution...  1/4th is fast enough in HD not to be an issue but for
 RT stream in 4K or more, 1/16 will be compulsory.

Interesting. I imagine much 4k content would work decently well even
with further downscaling of the lookahead pictures.

The lowres motion vectors are used in weight analysis as well, so that
file would need to be updated.

Another thing that makes our lookahead slower than x264's is our intra
analysis.  We're measuring DC and planar and then running our `all-angs'
function which generates all 33 angular predictions at once and then
measuring them all with satd. I've been wanting to turn that into a scan
that measures 6 angular modes evenly spaced by 5.  You then pick the
best angular option and search +2 and -2, then pick the best again and
search +1 and -1 (a gradient descent). At the end you measure DC and
planar and pick the best cost mode.  This results in 12 predictions and
satd calls instead of 35 (closer to x264's 9 lowres predictions).

Our `all-angs' function is pretty good, it takes approx 10x the time of
one angular prediction function to generate all 33. So it's not obvious
whether it would be better for this scan approach to call the ten
singular angular functions or the 'all-angs` function once. About half
of the angular predictions must be transposed - and the all-angs
function ignores this for better performance and thus we transpose the
original pixels instead when measuring satd cost of those modes (but we
only have to do that transpose once). The individual angular functions
do this transpose for you internally. if the mode requires it, resulting
in possibly more transposes. This further complicates guessing which
approach would be faster.

Once this approach was working we could use it in the main encode
functions as a --fast-intra option.

Lastly, we need a signed CLA from you before we can accept code
contributions: https://bitbucket.org/multicoreware/x265/wiki/Contribute

-- 
Steve Borho
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel