Thank you for this detailed insight. My trust in your competence will
persist. ;)
It is certainly important to know that parallelizing has several "levels"
(frame/slice/references) and depends on the video attributes (e.g. number
of slices due to the height).
Seeing mainly 2/10 threads active happened only in the slowest presets.
Average presets usually had 6/10 threads active, but still with lower
overall CPU utilization; well possible that videos with small dimensions
are encoded less efficiently, regarding thread sync overhead, somehow.
Am 24.11.2013, 23:04 Uhr, schrieb Steve Borho <[email protected]>:
On Nov 23, 2013, at 1:06 PM, Mario Rohkrämer <[email protected]> wrote:
Am 23.11.2013, 19:45 Uhr, schrieb Tom Vaughan
<[email protected]>:
Mario,
The number of concurrently encoded frames is already reported in the
x265[info] output.
Example:
x265 [info]: WPP streams / pool / frames : 17 / 32 / 1
That makes me almost concerned...
With a Phenom-II X6 (6 cores) I get e.g.: 5 / 6 / *2*
So it encodes only 2 frames in parallel? Because there are other
intense tasks utilizing other threads?
According to ProcessExplorer, x265 runs 10 threads. Up to 6 of them are
more or less busy. Sometimes only 2 of them, depending on the preset
used. So it is probably correct, just possibly not yet "optimal".
The three numbers in that line describe all the parallelism variables.
The encoder is creating 6 worker threads, one for each CPU core. The
worker threads encode a row of CTUs at a time (glossing over a few
details). Your video is fairly small, only 5 rows of 64x64 blocks, so
there is not much parallelism there to be exposed to wave-front
analysis. The 2 frame threads, by design, are mostly idle, they have
some setup work at the beginning of each frame and some entropy encode
work at the end of each frame, but for the bulk of the encode time they
are blocked waiting for reference frames to complete rows or for their
own rows to be completed.
Adding more frame threads would not necessarily help much, since there
is a three-row lag between reference frames (deblock+sao+me-range), 5
rows does not give you much room for frame parallelism either.
At that resolution, you would be better served with 32x32 blocks (--ctu
32) if you need to keep more cores occupied. You would get more
wave-tront parallelism and could probably bump to -F3 effectively. You
will want to decrease the me-range to 28 (ctu size minus luma
half-filter) to keep the me-range from limiting frame parallelism.
--
Steve Borho
--
__________
Fun and success!
Mario *LigH* Rohkrämer
mailto:[email protected]
_______________________________________________
x265-devel mailing list
[email protected]
https://mailman.videolan.org/listinfo/x265-devel