At 2015-03-11 06:01:08,dave <[email protected]> wrote: >On 03/10/2015 12:12 PM, Steve Borho wrote: >> On 03/10, dave wrote: >>>>> This produces some interesting numbers. >>> sorry, I mixed these two up. >>>>>>> incorrect:Without using registers for constants >>>>>>> with using registers >>>>>>> x265 [info]: I32: Intra 100%(DC 0% P 40% Ang 58%) >>>>>>> >>>>>>> encoded 2000 frames in 95.98s (20.84 fps), 1020.04 kb/s >>>>>>> >>>>>>> incorrect:With using registers for constants >>>>>>> without using registers >>>>>>> x265 [info]: I32: Intra 99%(DC 39% P 16% Ang 43%) >>>>>>> >>>>>>> encoded 2000 frames in 93.10s (21.48 fps), 1008.63 kb/s >>>>>>> >>>>>>> I just added --cu-stats to the same command options that I used >>>>>>> previously and I ran it several times and got exactly the same >>>>>>> percentages. Times varied by less than a second for each build. So >>>>>>> how can simple register usage in one primitive affect intra pred >>>>>>> decisions? >>>>>> it shouldn't, the behavior must be wrong in one of the cases. no change >>>>>> in performance should be able to impact the encoder output (or any >>>>>> coding decisions) >>>>>> >>>>> So execution time isn't directly measured for decision making? >>>>> >>>>> The output is also different. >>>>> >>>>> ls -l bridge-close* >>>>> -rw-r--r-- 1 shakezula shakezula 8432204 Mar 10 09:25 bridge-close1.y4m >>>>> -rw-r--r-- 1 shakezula shakezula 8527219 Mar 10 07:49 bridge-close.y4m >>>>> >>>>> bridge-close1.y4m was generated without the use of registers to hold >>>>> constants. >>>> yeah, definitely a bug in one of the two versions and if the testbench >>>> doesn't catch it that's really bad. >>> I am using the same source tree for both so the only differences is >>> the register usage. >>> >>> The unpatched tip, which is going to use c code for planar32, >>> produces the same intra pred decision percentages as not using >>> registers for constants but different encoded output. >>> >>> x265 [info]: I32: Intra 99%(DC 39% P 16% Ang 43%) >>> >>> encoded 2000 frames in 101.82s (19.64 fps), 1008.64 kb/s >>> >>> ls -l bridge-close.* >>> -rw-r--r-- 1 shakezula shakezula 8432239 Mar 10 10:03 bridge-close.hevc >>> >>> The reconstructed output of all three looks the same. >>> >>> Just to test for overflow I modified the testbench to test with all >>> maximum 10-bit values of 0x3FF instead of random values and it >>> passes. One more bit, 0x4FF, and it fails. Though the y4m file has >>> 8 bit depth. >> this sounds like your outputs would be non-deterministic if you just ran >> the same encode multiple times? That would be a different class of bug, >> perhaps unrelated to your work on the intra primitives. >> >> I don't think we often check for non-determinism on older architectures. >> we regularly test --no-asm against fully optimized outputs but this >> only tests primitives normally used on our test machines. >According to agner(The microarchitecture of Intel and AMD CPUs, p169) >there is a non-deterministic aspect of my processor but it should only >affect execution time, not output. Most of the primitives that I have >worked on have variable results from the testbench when run repeatedly. >A few even seem to randomly alternate between two distinct execution >times, something that you might expect from agner's findings. > >The qp and bits generated is consistent across encodes. The qp is >mostly consistent across builds only varying by .01 if at all but the >bits varies more so across builds. > >After all this, what is preferred? Constants copied to registers or >used from memory? The benchtest says memory runs faster(I believe they >are cached in a temp register,see agner). Encodes are less conclusive >since each build doesn't use planar32 equally. > Are you save and restore the extra constant XMM register? the compiler store float in it
_______________________________________________ x265-devel mailing list [email protected] https://mailman.videolan.org/listinfo/x265-devel
