Re: [x265] Custom LowRes scale
I can do that :) Do you have standard way to generate these figures? (Video, options ?) Or shall I just generate a couple of figures to put in the commit? On 07/21/2014 06:16 PM, Deepthi Nandakumar wrote: Thanks, this is certainly an enhancement to x265 lookahead. We would be interested in this - especially if you can also include some efficiency (bitrate vs SSIM) metrics that describe the penalty moving from X265_LOWRES_SCALE of 4 to higher scales. On Mon, Jul 21, 2014 at 8:49 PM, Nicolas Morey-Chaisemartin nmo...@kalray.eu wrote: Hi, We recently profiled x265 pre-analysis to estimate what performance we could reach using our accelerator and I was quite disappointed by the performance. When running on a Core-i7 with AVX at roughly 2.7GHz, we barely reached the 30fps mark using ultrafast preset on a 4K video. After a little bit of browsing I realized that work in LosRew is always done at 1/4th of the final resolution which seems fair but requires a huge amount of work for 4K. It seemed straight forward enough to change the divider at LowRes initialization but it seems there are a lot of hard coded values that depend both on the LowRes divider and the LowRes CU Size. Here's a patch (definitly not applicable like this but just to give an idea of where I'm going) that seems to fix most of the hard-coded value. It still works with a X265_LOWRES_SCALE of 4 and the perf is definilty improving (29fps = 40fps on a 2048x1024 medium preset on a E5504). Would you be interested in a clean version of this? At least the hard-coded CU_SIZE part? IMHO it would be better to have dynamic value for LowRes depending on preset (or equivalent) and the input resolution... 1/4th is fast enough in HD not to be an issue but for RT stream in 4K or more, 1/16 will be compulsory. Nicolas --- x265/source/common/common.h | 1 + x265/source/common/lowres.cpp| 4 ++-- x265/source/encoder/frameencoder.cpp | 7 --- x265/source/encoder/ratecontrol.cpp | 16 x265/source/encoder/slicetype.cpp| 8 5 files changed, 19 insertions(+), 17 deletions(-) diff --git a/x265/source/common/common.h b/x265/source/common/common.h index 06f60e7..00e73fc 100644 --- a/x265/source/common/common.h +++ b/x265/source/common/common.h @@ -156,6 +156,7 @@ typedef int32_t coeff_t; // transform coefficient // high cost estimates (intra and inter both suffer) #define X265_LOWRES_CU_SIZE 8 #define X265_LOWRES_CU_BITS 3 +#define X265_LOWRES_SCALE 2 #define X265_MALLOC(type, count)(type*)x265_malloc(sizeof(type) * (count)) #define X265_FREE(ptr) x265_free(ptr) diff --git a/x265/source/common/lowres.cpp b/x265/source/common/lowres.cpp index 5fc2f6b..6138023 100644 --- a/x265/source/common/lowres.cpp +++ b/x265/source/common/lowres.cpp @@ -31,8 +31,8 @@ bool Lowres::create(TComPicYuv *orig, int _bframes, bool bAQEnabled) { isLowres = true; bframes = _bframes; -width = orig-getWidth() / 2; -lines = orig-getHeight() / 2; +width = orig-getWidth() / X265_LOWRES_SCALE; +lines = orig-getHeight() / X265_LOWRES_SCALE; lumaStride = width + 2 * orig-getLumaMarginX(); if (lumaStride 31) lumaStride += 32 - (lumaStride 31); diff --git a/x265/source/encoder/frameencoder.cpp b/x265/source/encoder/ frameencoder.cpp index 8c3ee26..7213f60 100644 --- a/x265/source/encoder/frameencoder.cpp +++ b/x265/source/encoder/frameencoder.cpp @@ -1300,9 +1300,10 @@ int FrameEncoder::calcQpForCu(uint32_t cuAddr, double baseQp) /* Derive qpOffet for each CU by averaging offsets for all 16x16 blocks in the cu. */ double qp_offset = 0; -int maxBlockCols = (m_frame-getPicYuvOrg()-getWidth() + (16 - 1)) / 16; -int maxBlockRows = (m_frame-getPicYuvOrg()-getHeight() + (16 - 1)) / 16; -int noOfBlocks = g_maxCUSize / 16; +int lowResCu = (X265_LOWRES_CU_SIZE * X265_LOWRES_SCALE); +int maxBlockCols = (m_frame-getPicYuvOrg()-getWidth() + (lowResCu - 1)) / lowResCu; +int maxBlockRows = (m_frame-getPicYuvOrg()-getHeight() + (lowResCu - 1)) / lowResCu; +int noOfBlocks = g_maxCUSize / lowResCu; int block_y = (cuAddr / m_frame-getPicSym()-getFrameWidthInCU()) * noOfBlocks; int block_x = (cuAddr * noOfBlocks) - block_y * m_frame-getPicSym()- getFrameWidthInCU(); diff --git a/x265/source/encoder/ratecontrol.cpp b/x265/source/encoder/ ratecontrol.cpp index 4358994..5fcc27a 100644 --- a/x265/source/encoder/ratecontrol.cpp +++ b/x265/source/encoder/ratecontrol.cpp @@ -161,8 +161,8 @@ void RateControl::calcAdaptiveQuantFrame(Frame *pic) if (m_param-rc.aqMode == X265_AQ_NONE || m_param-rc.aqStrength == 0) { /* Need to init it anyways for CU tree */ -int cuWidth = ((maxCol / 2) + X265_LOWRES_CU_SIZE - 1) X265_LOWRES_CU_BITS; -int cuHeight = ((maxRow / 2) + X265_LOWRES_CU_SIZE - 1) X265_LOWRES_CU_BITS; +int cuWidth = ((maxCol / X265_LOWRES_SCALE) +
Re: [x265] Custom LowRes scale
On 07/21/2014 07:11 PM, Steve Borho wrote: Interesting. I imagine much 4k content would work decently well even with further downscaling of the lookahead pictures. The lowres motion vectors are used in weight analysis as well, so that file would need to be updated. I'll have a look at it. It doesn't semm as straight forward as the other files though. While we're talking about lowres MV: from what I could gather they are not used during the motionSearch on the full res picture. As a lot of time is spent finding those, whouldn't it be useful to add them as candidate in the fullres search? Nicolas ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] Custom LowRes scale
On 07/22, Nicolas Morey-Chaisemartin wrote: On 07/21/2014 07:11 PM, Steve Borho wrote: Interesting. I imagine much 4k content would work decently well even with further downscaling of the lookahead pictures. The lowres motion vectors are used in weight analysis as well, so that file would need to be updated. I'll have a look at it. It doesn't semm as straight forward as the other files though. it is slightly more complicated; you'll want to scale up the block sizes used for motion-compensated weight analysis - up to 32x32 or 64x64 based on how much further you downscale the lowres in lookahead. While we're talking about lowres MV: from what I could gather they are not used during the motionSearch on the full res picture. As a lot of time is spent finding those, whouldn't it be useful to add them as candidate in the fullres search? This has been on my TODO list for ages; a couple of people have claimed they've tried it and it hasn't helped as much as you might think. But I haven't had a working patch in hand to verify it. The AMVP fixup after motion search, where we get to go shopping for a better MVP after the search, often makes extra motion candidates superfluous. -- Steve Borho ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
Re: [x265] Custom LowRes scale
On 07/22/2014 10:08 AM, Steve Borho wrote: On 07/22, Nicolas Morey-Chaisemartin wrote: I'll have a look at it. It doesn't semm as straight forward as the other files though. it is slightly more complicated; you'll want to scale up the block sizes used for motion-compensated weight analysis - up to 32x32 or 64x64 based on how much further you downscale the lowres in lookahead. Is there a clean way to get a LUMA_NNxNN value from a block size ? Should I handle block larger than 64x64 by looping on the 64x64 blocks? or simply add a check at lowres init that the fullres CU size is = 64 ? While we're talking about lowres MV: from what I could gather they are not used during the motionSearch on the full res picture. As a lot of time is spent finding those, whouldn't it be useful to add them as candidate in the fullres search? This has been on my TODO list for ages; a couple of people have claimed they've tried it and it hasn't helped as much as you might think. But I haven't had a working patch in hand to verify it. The AMVP fixup after motion search, where we get to go shopping for a better MVP after the search, often makes extra motion candidates superfluous. I started working on this yesterday for our accelerator but I got carried away on lowres scaling. I haven't any results yet but I'll post them as soon as I have some. By the way, lowres MV are in lowres luma pixels right? So I'll need to scale the vector by 2 to get the full MV? Nicolas ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel
[x265] Custom LowRes scale
Hi, We recently profiled x265 pre-analysis to estimate what performance we could reach using our accelerator and I was quite disappointed by the performance. When running on a Core-i7 with AVX at roughly 2.7GHz, we barely reached the 30fps mark using ultrafast preset on a 4K video. After a little bit of browsing I realized that work in LosRew is always done at 1/4th of the final resolution which seems fair but requires a huge amount of work for 4K. It seemed straight forward enough to change the divider at LowRes initialization but it seems there are a lot of hard coded values that depend both on the LowRes divider and the LowRes CU Size. Here's a patch (definitly not applicable like this but just to give an idea of where I'm going) that seems to fix most of the hard-coded value. It still works with a X265_LOWRES_SCALE of 4 and the perf is definilty improving (29fps = 40fps on a 2048x1024 medium preset on a E5504). Would you be interested in a clean version of this? At least the hard-coded CU_SIZE part? IMHO it would be better to have dynamic value for LowRes depending on preset (or equivalent) and the input resolution... 1/4th is fast enough in HD not to be an issue but for RT stream in 4K or more, 1/16 will be compulsory. Nicolas --- x265/source/common/common.h | 1 + x265/source/common/lowres.cpp| 4 ++-- x265/source/encoder/frameencoder.cpp | 7 --- x265/source/encoder/ratecontrol.cpp | 16 x265/source/encoder/slicetype.cpp| 8 5 files changed, 19 insertions(+), 17 deletions(-) diff --git a/x265/source/common/common.h b/x265/source/common/common.h index 06f60e7..00e73fc 100644 --- a/x265/source/common/common.h +++ b/x265/source/common/common.h @@ -156,6 +156,7 @@ typedef int32_t coeff_t; // transform coefficient // high cost estimates (intra and inter both suffer) #define X265_LOWRES_CU_SIZE 8 #define X265_LOWRES_CU_BITS 3 +#define X265_LOWRES_SCALE 2 #define X265_MALLOC(type, count)(type*)x265_malloc(sizeof(type) * (count)) #define X265_FREE(ptr) x265_free(ptr) diff --git a/x265/source/common/lowres.cpp b/x265/source/common/lowres.cpp index 5fc2f6b..6138023 100644 --- a/x265/source/common/lowres.cpp +++ b/x265/source/common/lowres.cpp @@ -31,8 +31,8 @@ bool Lowres::create(TComPicYuv *orig, int _bframes, bool bAQEnabled) { isLowres = true; bframes = _bframes; -width = orig-getWidth() / 2; -lines = orig-getHeight() / 2; +width = orig-getWidth() / X265_LOWRES_SCALE; +lines = orig-getHeight() / X265_LOWRES_SCALE; lumaStride = width + 2 * orig-getLumaMarginX(); if (lumaStride 31) lumaStride += 32 - (lumaStride 31); diff --git a/x265/source/encoder/frameencoder.cpp b/x265/source/encoder/frameencoder.cpp index 8c3ee26..7213f60 100644 --- a/x265/source/encoder/frameencoder.cpp +++ b/x265/source/encoder/frameencoder.cpp @@ -1300,9 +1300,10 @@ int FrameEncoder::calcQpForCu(uint32_t cuAddr, double baseQp) /* Derive qpOffet for each CU by averaging offsets for all 16x16 blocks in the cu. */ double qp_offset = 0; -int maxBlockCols = (m_frame-getPicYuvOrg()-getWidth() + (16 - 1)) / 16; -int maxBlockRows = (m_frame-getPicYuvOrg()-getHeight() + (16 - 1)) / 16; -int noOfBlocks = g_maxCUSize / 16; +int lowResCu = (X265_LOWRES_CU_SIZE * X265_LOWRES_SCALE); +int maxBlockCols = (m_frame-getPicYuvOrg()-getWidth() + (lowResCu - 1)) / lowResCu; +int maxBlockRows = (m_frame-getPicYuvOrg()-getHeight() + (lowResCu - 1)) / lowResCu; +int noOfBlocks = g_maxCUSize / lowResCu; int block_y = (cuAddr / m_frame-getPicSym()-getFrameWidthInCU()) * noOfBlocks; int block_x = (cuAddr * noOfBlocks) - block_y * m_frame-getPicSym()-getFrameWidthInCU(); diff --git a/x265/source/encoder/ratecontrol.cpp b/x265/source/encoder/ratecontrol.cpp index 4358994..5fcc27a 100644 --- a/x265/source/encoder/ratecontrol.cpp +++ b/x265/source/encoder/ratecontrol.cpp @@ -161,8 +161,8 @@ void RateControl::calcAdaptiveQuantFrame(Frame *pic) if (m_param-rc.aqMode == X265_AQ_NONE || m_param-rc.aqStrength == 0) { /* Need to init it anyways for CU tree */ -int cuWidth = ((maxCol / 2) + X265_LOWRES_CU_SIZE - 1) X265_LOWRES_CU_BITS; -int cuHeight = ((maxRow / 2) + X265_LOWRES_CU_SIZE - 1) X265_LOWRES_CU_BITS; +int cuWidth = ((maxCol / X265_LOWRES_SCALE) + X265_LOWRES_CU_SIZE - 1) X265_LOWRES_CU_BITS; +int cuHeight = ((maxRow / X265_LOWRES_SCALE) + X265_LOWRES_CU_SIZE - 1) X265_LOWRES_CU_BITS; int cuCount = cuWidth * cuHeight; if (m_param-rc.aqMode m_param-rc.aqStrength == 0) @@ -194,9 +194,9 @@ void RateControl::calcAdaptiveQuantFrame(Frame *pic) if (m_param-rc.aqMode == X265_AQ_AUTO_VARIANCE) { double bit_depth_correction = pow(1 (X265_DEPTH - 8), 0.5); -for (block_y = 0; block_y maxRow; block_y += 16) +for
Re: [x265] Custom LowRes scale
Thanks, this is certainly an enhancement to x265 lookahead. We would be interested in this - especially if you can also include some efficiency (bitrate vs SSIM) metrics that describe the penalty moving from X265_LOWRES_SCALE of 4 to higher scales. On Mon, Jul 21, 2014 at 8:49 PM, Nicolas Morey-Chaisemartin nmo...@kalray.eu wrote: Hi, We recently profiled x265 pre-analysis to estimate what performance we could reach using our accelerator and I was quite disappointed by the performance. When running on a Core-i7 with AVX at roughly 2.7GHz, we barely reached the 30fps mark using ultrafast preset on a 4K video. After a little bit of browsing I realized that work in LosRew is always done at 1/4th of the final resolution which seems fair but requires a huge amount of work for 4K. It seemed straight forward enough to change the divider at LowRes initialization but it seems there are a lot of hard coded values that depend both on the LowRes divider and the LowRes CU Size. Here's a patch (definitly not applicable like this but just to give an idea of where I'm going) that seems to fix most of the hard-coded value. It still works with a X265_LOWRES_SCALE of 4 and the perf is definilty improving (29fps = 40fps on a 2048x1024 medium preset on a E5504). Would you be interested in a clean version of this? At least the hard-coded CU_SIZE part? IMHO it would be better to have dynamic value for LowRes depending on preset (or equivalent) and the input resolution... 1/4th is fast enough in HD not to be an issue but for RT stream in 4K or more, 1/16 will be compulsory. Nicolas --- x265/source/common/common.h | 1 + x265/source/common/lowres.cpp| 4 ++-- x265/source/encoder/frameencoder.cpp | 7 --- x265/source/encoder/ratecontrol.cpp | 16 x265/source/encoder/slicetype.cpp| 8 5 files changed, 19 insertions(+), 17 deletions(-) diff --git a/x265/source/common/common.h b/x265/source/common/common.h index 06f60e7..00e73fc 100644 --- a/x265/source/common/common.h +++ b/x265/source/common/common.h @@ -156,6 +156,7 @@ typedef int32_t coeff_t; // transform coefficient // high cost estimates (intra and inter both suffer) #define X265_LOWRES_CU_SIZE 8 #define X265_LOWRES_CU_BITS 3 +#define X265_LOWRES_SCALE 2 #define X265_MALLOC(type, count)(type*)x265_malloc(sizeof(type) * (count)) #define X265_FREE(ptr) x265_free(ptr) diff --git a/x265/source/common/lowres.cpp b/x265/source/common/lowres.cpp index 5fc2f6b..6138023 100644 --- a/x265/source/common/lowres.cpp +++ b/x265/source/common/lowres.cpp @@ -31,8 +31,8 @@ bool Lowres::create(TComPicYuv *orig, int _bframes, bool bAQEnabled) { isLowres = true; bframes = _bframes; -width = orig-getWidth() / 2; -lines = orig-getHeight() / 2; +width = orig-getWidth() / X265_LOWRES_SCALE; +lines = orig-getHeight() / X265_LOWRES_SCALE; lumaStride = width + 2 * orig-getLumaMarginX(); if (lumaStride 31) lumaStride += 32 - (lumaStride 31); diff --git a/x265/source/encoder/frameencoder.cpp b/x265/source/encoder/ frameencoder.cpp index 8c3ee26..7213f60 100644 --- a/x265/source/encoder/frameencoder.cpp +++ b/x265/source/encoder/frameencoder.cpp @@ -1300,9 +1300,10 @@ int FrameEncoder::calcQpForCu(uint32_t cuAddr, double baseQp) /* Derive qpOffet for each CU by averaging offsets for all 16x16 blocks in the cu. */ double qp_offset = 0; -int maxBlockCols = (m_frame-getPicYuvOrg()-getWidth() + (16 - 1)) / 16; -int maxBlockRows = (m_frame-getPicYuvOrg()-getHeight() + (16 - 1)) / 16; -int noOfBlocks = g_maxCUSize / 16; +int lowResCu = (X265_LOWRES_CU_SIZE * X265_LOWRES_SCALE); +int maxBlockCols = (m_frame-getPicYuvOrg()-getWidth() + (lowResCu - 1)) / lowResCu; +int maxBlockRows = (m_frame-getPicYuvOrg()-getHeight() + (lowResCu - 1)) / lowResCu; +int noOfBlocks = g_maxCUSize / lowResCu; int block_y = (cuAddr / m_frame-getPicSym()-getFrameWidthInCU()) * noOfBlocks; int block_x = (cuAddr * noOfBlocks) - block_y * m_frame-getPicSym()- getFrameWidthInCU(); diff --git a/x265/source/encoder/ratecontrol.cpp b/x265/source/encoder/ ratecontrol.cpp index 4358994..5fcc27a 100644 --- a/x265/source/encoder/ratecontrol.cpp +++ b/x265/source/encoder/ratecontrol.cpp @@ -161,8 +161,8 @@ void RateControl::calcAdaptiveQuantFrame(Frame *pic) if (m_param-rc.aqMode == X265_AQ_NONE || m_param-rc.aqStrength == 0) { /* Need to init it anyways for CU tree */ -int cuWidth = ((maxCol / 2) + X265_LOWRES_CU_SIZE - 1) X265_LOWRES_CU_BITS; -int cuHeight = ((maxRow / 2) + X265_LOWRES_CU_SIZE - 1) X265_LOWRES_CU_BITS; +int cuWidth = ((maxCol / X265_LOWRES_SCALE) + X265_LOWRES_CU_SIZE - 1) X265_LOWRES_CU_BITS; +int cuHeight = ((maxRow / X265_LOWRES_SCALE) + X265_LOWRES_CU_SIZE - 1)
Re: [x265] Custom LowRes scale
On 07/21, Nicolas Morey-Chaisemartin wrote: Hi, We recently profiled x265 pre-analysis to estimate what performance we could reach using our accelerator and I was quite disappointed by the performance. When running on a Core-i7 with AVX at roughly 2.7GHz, we barely reached the 30fps mark using ultrafast preset on a 4K video. After a little bit of browsing I realized that work in LosRew is always done at 1/4th of the final resolution which seems fair but requires a huge amount of work for 4K. It seemed straight forward enough to change the divider at LowRes initialization but it seems there are a lot of hard coded values that depend both on the LowRes divider and the LowRes CU Size. Here's a patch (definitly not applicable like this but just to give an idea of where I'm going) that seems to fix most of the hard-coded value. It still works with a X265_LOWRES_SCALE of 4 and the perf is definilty improving (29fps = 40fps on a 2048x1024 medium preset on a E5504). Would you be interested in a clean version of this? At least the hard-coded CU_SIZE part? IMHO it would be better to have dynamic value for LowRes depending on preset (or equivalent) and the input resolution... 1/4th is fast enough in HD not to be an issue but for RT stream in 4K or more, 1/16 will be compulsory. Interesting. I imagine much 4k content would work decently well even with further downscaling of the lookahead pictures. The lowres motion vectors are used in weight analysis as well, so that file would need to be updated. Another thing that makes our lookahead slower than x264's is our intra analysis. We're measuring DC and planar and then running our `all-angs' function which generates all 33 angular predictions at once and then measuring them all with satd. I've been wanting to turn that into a scan that measures 6 angular modes evenly spaced by 5. You then pick the best angular option and search +2 and -2, then pick the best again and search +1 and -1 (a gradient descent). At the end you measure DC and planar and pick the best cost mode. This results in 12 predictions and satd calls instead of 35 (closer to x264's 9 lowres predictions). Our `all-angs' function is pretty good, it takes approx 10x the time of one angular prediction function to generate all 33. So it's not obvious whether it would be better for this scan approach to call the ten singular angular functions or the 'all-angs` function once. About half of the angular predictions must be transposed - and the all-angs function ignores this for better performance and thus we transpose the original pixels instead when measuring satd cost of those modes (but we only have to do that transpose once). The individual angular functions do this transpose for you internally. if the mode requires it, resulting in possibly more transposes. This further complicates guessing which approach would be faster. Once this approach was working we could use it in the main encode functions as a --fast-intra option. Lastly, we need a signed CLA from you before we can accept code contributions: https://bitbucket.org/multicoreware/x265/wiki/Contribute -- Steve Borho ___ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel