(I'm re-sending this message because the attachment was too large for mesa-dev, and because I want to add wayland-devel CC. The valgrind output can be found here: http://runeks.dk/files/callgrind.out.11362).
Seems like you are right Pekka. I just ran weston through valgrind, and got some interesting results. I ran it like so: valgrind --tool=callgrind --dump-instr=yes --trace-jump=yes weston Which allows me to get the time spent on a per-instruction level. Now, this is running inside a virtual machine, so it won't be the same as running it natively, but it agrees with the other benchmarks in the sense that it suggests it is the simple calculations inside screenshooter.c that take up most of the CPU time, not calls to outside functions (like it was before with the slow glReadPixels() path). The callgrind output can be found at the following URL: http://runeks.dk/files/callgrind.out.11362 Open it with KCachegrind, select the function weston_recorder_frame_notify(), and go to the "Machine Code" tab in the lower right corner to see the interesting stuff. According to callgrind, a total of 54.39% CPU time is used in the four lines 251, 252, 253 and 255 in screenshooter.c. That's the function component_delta(): dr = (next >> 16) - (prev >> 16); dg = (next >> 8) - (prev >> 8); db = (next >> 0) - (prev >> 0); return (dr << 16) | (dg << 8) | (db << 0); Additionally, the lines 358, 359, 361, 362, 363 take up 25.9% CPU time. That is the innermost for-loop inside weston_recorder_frame_notify(). It's the following lines with the call to component_delta() not included (line 360) since we've already included the CPU usage of that: for (k = 0; k < width; k++) { next = *s++; delta = component_delta(next, *d); *d++ = next; if (run == 0 || delta == prev) { run++; And then finally the call to output_run() on line 365 takes up 10.6% CPU time: p = output_run(p, prev, run); Inside output_run(), the lines that take up the most CPU time are: 232, 233, 234, 235, 239, 240. They take up 9.94% CPU time. So it's basically the whole while loop in output_run() except the line with the call to __builtin_clz() (line 238): while (run > 0) { if (run <= 0xe0) { *p++ = delta | ((run - 1) << 24); break; } i = 24 - __builtin_clz(run); *p++ = delta | ((i + 0xe0) << 24); run -= 1 << (7 + i); } All of that adds up 90.89% CPU time spent inside screenshooter.c. /Rune On Tue, Mar 26, 2013 at 8:58 AM, Pekka Paalanen <[email protected]> wrote: On Tue, 26 Mar 2013 03:30:58 +0100 Rune Kjær Svendsen <[email protected]> wrote: > Marek, do you have an idea on where the currency bottleneck is? > > I just did a profiling with sysprof, zooming in on the desktop in Weston > and moving the mouse wildly around, so that the buffer is completely > changed for every frame. I got around 5 fps, which isn't *that* much, but > still an order of magnitude better than without your patches. > > sysprof says there is 100% CPU usage, but unlike the previous 0.5-FPS > recording, it's not in a single function, but spread out over several > functions: > > 35% weston_recorder_frame_notify > 11% __memcpy_ssse3 > 4.5% clear_page_c > 4.3% output_run > > Although I'm not completely sure I'm reading the sysprof output right. > weston_recorder_frame_notify, for example, has 35% CPU usage, but none of > its child functions has any significant CPU usage. I presume the CPU usage > in that function is from calling glReadPixels, although that's not apparent > from sysprof: > > weston_recorder_frame_notify 39.15% > 39.15% > - - kernel - - 0.00% > 0.01% > ret_from_intr 0.00% > 0.01% > __irqentry_text_start 0.00% Well, if you look at weston_recorder_frame_notify function, it has a naive loop over each single pixel it processes. component_delta() may get inlined, and output_run() you saw in the profile. I think it's possible it actually is weston_recorder_frame_notify eating the CPU. Can you get more precise profiling, like hits per source line or instruction? Thanks, pq Med venlig hilsen, Rune Kjær Svendsen Østerbrogade 111, 3. - 302 2100 København Ø Tlf.: 2835 0726 On Tue, Mar 26, 2013 at 8:58 AM, Pekka Paalanen <[email protected]> wrote: > On Tue, 26 Mar 2013 03:30:58 +0100 > Rune Kjær Svendsen <[email protected]> wrote: > > > Marek, do you have an idea on where the currency bottleneck is? > > > > I just did a profiling with sysprof, zooming in on the desktop in Weston > > and moving the mouse wildly around, so that the buffer is completely > > changed for every frame. I got around 5 fps, which isn't *that* much, but > > still an order of magnitude better than without your patches. > > > > sysprof says there is 100% CPU usage, but unlike the previous 0.5-FPS > > recording, it's not in a single function, but spread out over several > > functions: > > > > 35% weston_recorder_frame_notify > > 11% __memcpy_ssse3 > > 4.5% clear_page_c > > 4.3% output_run > > > > Although I'm not completely sure I'm reading the sysprof output right. > > weston_recorder_frame_notify, for example, has 35% CPU usage, but none of > > its child functions has any significant CPU usage. I presume the CPU > usage > > in that function is from calling glReadPixels, although that's not > apparent > > from sysprof: > > > > weston_recorder_frame_notify 39.15% > > 39.15% > > - - kernel - - 0.00% > > 0.01% > > ret_from_intr 0.00% > > 0.01% > > __irqentry_text_start 0.00% > > Well, if you look at weston_recorder_frame_notify function, it has a > naive loop over each single pixel it processes. component_delta() may > get inlined, and output_run() you saw in the profile. > > I think it's possible it actually is weston_recorder_frame_notify > eating the CPU. Can you get more precise profiling, like hits per source > line or instruction? > > > Thanks, > pq >
_______________________________________________ wayland-devel mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/wayland-devel
