Bruce Simpson wrote: > I found mpatrol, when used as an LD_PRELOAD, can have some problems with > symbol retrieval. I'll see if adding it to the link line (like Google's > cpu profiler) can overcome this issue. >
mpatrol will log and do backtrace at call site just fine under FreeBSD/i386, which suggests it's probably an x86-64 ABI issue. The mpatrol author isn't an ABI-head; there's been some list churn about it, and it sounds like it affects Linux too. Everyone, in free tool land, it seems, is waiting for libunwind to cut a fully working x86-64 release. Given the libunwind project originated at HP, it's not too much surprise to learn that they focused on Itanic^WItanium first. Re call site heap profiling; it's largely academic just now, although having it would save us a lot of time in tracking these things down. I should mention at this point, I'm still going with Marko's old hunch that it is allocator churn (operator new) which brings XRL I/O performance down. Anecdotal evidence seems to bear this out (looking at the hits in the call traces on the allocators). Valgrind (in callgrind mode) will give accurate call counts on 'operator new()', 'malloc()' and friends; that's what really matters. If we take a callgrind format sample from a real box (using oprofile or pmcstat), and then cross-reference, that'll quickly give us some insight into whether the XRL I/O paths are wedging on excessive allocations, or not. I would just like to have these things automated, so that when I get the Thrift TTransport code banged out, I can tell, at a glance, that I am not comparing apples with oranges, and that the improvements can be quantified more quickly. What's likely to give a performance boost, in the short term, is cutting over to UNIX domain stream sockets for XRL. These are likely to function like pipes. At least in FreeBSD, pipes are handled as a fifofs vnode, which shims directly to the UNIX domain stream socket I/O path; these are zero-copy inside the kernel, because the I/O is strictly local. I believe Linux has since adopted similar optimizations in its pipe and UNIX domain socket implementations. Local TCP can't offer such optimizations. The rules say, if it's a TCP, it has to act like a TCP. Even going over loopback means taking more locks, and running a full TCP state machine. So zero-copy is not as easily implemented on such paths. cheers, BMS _______________________________________________ Xorp-hackers mailing list [email protected] http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/xorp-hackers
