So currently the driver converts planar yuv data to packed when using textured video. And I fought so hard to convince the overlay scaler to accept planar data correctly ages ago :-). Anyway, when trying to view HD clips on my rs690, I noticed that Xorg indeed consumes quite a few cpu cycles, and some oprofile quickly revealed RADEONCopyMungedData as the top cpu hog as expected. So I figured what was a good idea for the overlay scaler should be a good idea for textured video, just copy planar data and change the shader accordingly. However, in contrast to the overlay scaler there are some drawbacks here, mostly the gpu will have to work harder cause it needs to sample 3 textures instead of 1, and the shader will be more complex. I don't think it should be much of a problem, since full 1080p25 requires a fillrate of "only" 50MT/s * 3, and even the slowest r300-based igp should have around 600MT/s IIRC (rs690 has 1600MT/s). Of course, it would also allow to change the coefficients used by the yuv->rgb conversion easily (I think some sources are actually meant to use a different spec here). The attached patch does exactly that (ok it was halfway copied from the intel driver), a couple of comments: - r300 only for now. r500 obviously doable, r200 should be possible too (I think that hopefully even the 2-pipe igp chips might be fast enough, with the added benefit that rv250 would get textured video too, as this doesn't rely on hw yuv-rgb conversion which is broken on that asic). Dunno about r100, it can't really run the necessary shader for such a conversion, however it has a PLANAR_YUV_ENABLE bit, which I don't know how it works and how the chip would need to be configured (but if it works should be very efficient on the upside). Also, this is mutually exclusive with bicubic filter (certainly could be done). - The xv attr for using the new code is rather for performance debugging than anything else... Speaking of that, naturally the RADEONCopyMungedData disappeared from the oprofile data, getting replaced by more libc usage (for memcpy which was expected) and initially way higher delay_tsc usage (which wasn't quite expected) with performance numbers (using mplayer's benchmark mode) actually slightly lower in some cases... However some tests revealed it seemed to be due to texture access latency or something along these lines (using textures with both macro and micro tiling improved things though I couldn't quite make it work for now due to blitter misconfiguration) I ended up with the manual texture cache configuration which is now indeed always faster (makes me wonder what performance gains you could get if you'd do that for the 3d driver and more interestingly HOW you'd do that sensibly there). - Overall I saw an increase of up to 10% using mplayer's benchmark mode (this was with ffmpeg-mt, and a X2 4850e), considering it's just barely faster than realtime every bit helps...
Comments?
r300_planar_texturedxv.diff
Description: application/mbox
_______________________________________________ xorg mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/xorg
