Hi everybody, I'm new in profiling and got stuck in analyzing cachegrind's output. Any help and hints for further profiling strategies are appreciated! It is not so important for me to get the problem at hand optimized, but to get some insights into profiling.
Background: I tried to implement a pool allocator for fixed size obects in C++. Since I know, that the objects allocated last will most likely get freed first, I attempted to implement some shrink-functionality: if some threshold of free slots are available at the end, shrink the pool. However, I failed miserably: My allocator is even slower than malloc/free. In my first attempt, I implemented some bitset as an array of ints to keep track of the occupied slots. I have two functions: find_first_unset_bit() (needed for allocation of blocks) and find_unset_tail() (needed for the shrink-feature). A gprof-run told me, that these two are hotspots and that the second one takes about 30% more time than the first one (for the most optimal allocation/deallocation pattern). I thought that it would be good to examine the reason for this as a starting point. A line-by-line gprof-analysis showed up to be pretty useless: disassembling my objects via `objdump -ld ...' told me that part of the instructions in my `-O3'-optimized binary are assigned to the "wrong" source lines. So I went on to try cachegrind out. Unfortunately, the results look pretty much the same for both functions (see below) and I'm stuck now. To give you some introduction on how the two functions are implemented (if you like, I could also paste the original C++ code, but I think this is quite pointless): `find_first_unset_bit()' and `find_unset_tail()' make use of some member-variables: `_i_word_first_unset' and `_i_word_unset_tail' respecively. These are indices into the array of integers and updated by the bit setting and clearing functions. For the first, it is guaranteed that all words before this index are completely set (0xffffffff), for the second, it is guaranteed, that all words from that index on are zero. Integers are searched from these indices either forward or backwards for values != 0xffffffff, != 0 resp. Then a bsf/bsr insn is utilized to get the first nonzero bit/the last nonzero bit. Here is cachegrind's output for the two (with accumulated counts added for your convenience): --8<---------------cut here---------------start------------->8--- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Bc Bcm Bi Bim . . . . . . . . . . . . . _ZNK5xmltr6memory15input_page_impl14dynamic_bitset20find_first_unset_bitEv: . . . . . . . . . . . . . .LFB866: . . . . . . . . . . . . . .cfi_startproc 640,000 1 1 640,000 0 0 0 0 0 0 0 0 0 movq 8(%rdi), %rcx 640,000 0 0 640,000 0 0 0 0 0 0 0 0 0 movq 16(%rdi), %rax 640,000 0 0 640,000 0 0 0 0 0 0 0 0 0 movq 32(%rdi), %rdx 640,000 0 0 0 0 0 0 0 0 0 0 0 0 subq %rcx, %rax 640,000 0 0 0 0 0 0 0 0 0 0 0 0 sarq $2, %rax 640,000 0 0 0 0 0 0 0 0 0 0 0 0 cmpq %rdx, %rax 640,000 0 0 0 0 0 0 0 0 640,000 6 0 0 jne .L24 100 0 0 0 0 0 0 0 0 0 0 0 0 jmp .L21 . . . . . . . . . . . . . .p2align 4,,10 . . . . . . . . . . . . . .p2align 3 . . . . . . . . . . . . . .L27: 19,900 0 0 0 0 0 0 0 0 0 0 0 0 addq $1, %rdx 19,900 0 0 0 0 0 0 0 0 0 0 0 0 cmpq %rdx, %rax 19,900 0 0 0 0 0 0 0 0 19,900 0 0 0 je .L21 . . . . . . . . . . . . . .L24: 659,800 0 0 659,800 0 0 0 0 0 0 0 0 0 movl (%rcx,%rdx,4), %esi 659,800 0 0 0 0 0 0 0 0 0 0 0 0 xorl $-1, %esi 659,800 0 0 0 0 0 0 0 0 659,800 19,900 0 0 je .L27 639,900 0 0 0 0 0 0 0 0 0 0 0 0 bsfl %esi, %esi 639,900 0 0 0 0 0 0 0 0 0 0 0 0 movl $-1, %eax 639,900 0 0 0 0 0 639,900 0 0 0 0 0 0 movq %rdx, 32(%rdi) 639,900 0 0 0 0 0 0 0 0 0 0 0 0 cmovne %esi, %eax 639,900 0 0 0 0 0 0 0 0 0 0 0 0 salq $5, %rdx 639,900 0 0 0 0 0 0 0 0 0 0 0 0 movl %eax, %eax 639,900 0 0 0 0 0 0 0 0 0 0 0 0 addq %rdx, %rax 639,900 0 0 639,900 0 0 0 0 0 0 0 0 0 ret . . . . . . . . . . . . . .p2align 4,,10 . . . . . . . . . . . . . .p2align 3 . . . . . . . . . . . . . .L21: 100 1 1 0 0 0 100 0 0 0 0 0 0 movq %rdx, 32(%rdi) 100 0 0 0 0 0 0 0 0 0 0 0 0 xorl %eax, %eax 100 0 0 0 0 0 0 0 0 0 0 0 0 salq $5, %rdx 100 0 0 0 0 0 0 0 0 0 0 0 0 addq %rdx, %rax 100 0 0 100 0 0 0 0 0 0 0 0 0 ret . . . . . . . . . . . . . .cfi_endproc ------------------------------------------------------------------------------- 11638900 2 2 3219800 0 0 640000 0 0 1319700 19906 0 0 Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Bc Bcm Bi Bim . . . . . . . . . . . . . _ZNK5xmltr6memory15input_page_impl14dynamic_bitset15find_unset_tailEv: . . . . . . . . . . . . . .LFB867: . . . . . . . . . . . . . .cfi_startproc 640,000 0 0 640,000 0 0 0 0 0 0 0 0 0 movq 40(%rdi), %rdx 640,000 0 0 0 0 0 0 0 0 0 0 0 0 testq %rdx, %rdx 640,000 0 0 0 0 0 0 0 0 640,000 2 0 0 je .L35 640,000 0 0 640,000 0 0 0 0 0 0 0 0 0 movq 8(%rdi), %rsi 640,000 1 1 0 0 0 0 0 0 0 0 0 0 subq $1, %rdx 640,000 0 0 640,000 0 0 0 0 0 0 0 0 0 movl (%rsi,%rdx,4), %eax 640,000 0 0 0 0 0 0 0 0 0 0 0 0 testl %eax, %eax 640,000 0 0 0 0 0 0 0 0 640,000 20,000 0 0 je .L32 620,000 0 0 0 0 0 0 0 0 0 0 0 0 jmp .L41 . . . . . . . . . . . . . .p2align 4,,10 . . . . . . . . . . . . . .p2align 3 . . . . . . . . . . . . . .L34: 19,900 0 0 0 0 0 0 0 0 0 0 0 0 leaq -1(%rdx), %rcx 19,900 0 0 19,900 0 0 0 0 0 0 0 0 0 movl (%rsi,%rcx,4), %eax 19,900 0 0 0 0 0 0 0 0 0 0 0 0 testl %eax, %eax 19,900 0 0 0 0 0 0 0 0 19,900 2 0 0 jne .L40 . . . . . . . . . . . . . movq %rcx, %rdx . . . . . . . . . . . . . .L32: 20,000 0 0 0 0 0 0 0 0 0 0 0 0 testq %rdx, %rdx 20,000 0 0 0 0 0 0 0 0 20,000 102 0 0 jne .L34 100 0 0 0 0 0 100 0 0 0 0 0 0 movq $0, 40(%rdi) . . . . . . . . . . . . . .L35: 100 0 0 0 0 0 0 0 0 0 0 0 0 xorl %eax, %eax 100 0 0 100 0 0 0 0 0 0 0 0 0 ret . . . . . . . . . . . . . .p2align 4,,10 . . . . . . . . . . . . . .p2align 3 . . . . . . . . . . . . . .L40: 19,900 0 0 0 0 0 19,900 0 0 0 0 0 0 movq %rdx, 40(%rdi) . . . . . . . . . . . . . .L31: 639,900 0 0 0 0 0 0 0 0 0 0 0 0 bsrl %eax, %edx 639,900 0 0 0 0 0 0 0 0 0 0 0 0 movl $32, %eax 639,900 0 0 0 0 0 0 0 0 0 0 0 0 xorl $31, %edx 639,900 0 0 0 0 0 0 0 0 0 0 0 0 subl %edx, %eax 639,900 0 0 0 0 0 0 0 0 639,900 0 0 0 je .L35 639,900 0 0 0 0 0 0 0 0 0 0 0 0 salq $5, %rcx 639,900 0 0 0 0 0 0 0 0 0 0 0 0 movl %eax, %eax 639,900 0 0 0 0 0 0 0 0 0 0 0 0 addq %rcx, %rax 639,900 0 0 639,900 0 0 0 0 0 0 0 0 0 ret . . . . . . . . . . . . . .L41: 620,000 1 1 0 0 0 0 0 0 0 0 0 0 movq %rdx, %rcx 620,000 0 0 0 0 0 0 0 0 0 0 0 0 jmp .L31 . . . . . . . . . . . . . .cfi_endproc ------------------------------------------------------------------------------- 12878900 2 2 2579900 0 0 20000 0 0 1959800 20106 0 0 --8<---------------cut here---------------end--------------->8--- The isns (Ir) are comparable, so are the cache misses. The faster `find_first_unset_bit()' has even far more memory writes (Dw) than the slower `find_unset_tail()' and slightly more memory reads. The only noticeable difference I can see are the branch counts: `find_unset_tail()' has got about 48% more of them than `find_first_unset_bit()'. However, the instuction read cache misses and the branch mispredicts are negliglble in both case and according to AMD's "Software Optimization Guide for AMD Family 10h and 12h Processors" (which applies to my box), the Jcc as well as the JMP insns have got a latency of 1 cycle. So, this can't be the reason for the runtime difference or am I getting something wrong? Thanks alot for your time and efforts! Best, Nicolai ------------------------------------------------------------------------------ Shape the Mobile Experience: Free Subscription Software experts and developers: Be at the forefront of tech innovation. Intel(R) Software Adrenaline delivers strategic insight and game-changing conversations that shape the rapidly evolving mobile landscape. Sign up now. http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk _______________________________________________ Valgrind-users mailing list Valgrind-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/valgrind-users