Hi Robert, The flame graph is here:
<http://fere.be/p/netatalk-flamegraph.svg> with corresponding output from Ralph's Dtrace script here: <http://fere.be/p/netatalk.afp.d.out> The raw kernel Dtrace output of dtrace -x stackframes=100 -n 'profile-997 /arg0/ { @[stack()] = count(); } tick-60s { exit(0); }' -o netatalk.kern_stacks-dtrace.out is here: <http://fere.be/p/netatalk.kern_stacks-dtrace.out> So if I understand things up to this point, the kernel is spending a lot of time in dnlc_reverse_lookup. Does that tell you anything? Thanks, Chris > Am 31.07.2014 um 17:29 schrieb Robert Mustacchi <[email protected]>: > >> On 07/31/2014 08:15 AM, Chris Ferebee via smartos-discuss wrote: >> I'm trying to debug performance issues on a SmartOS joyent_20140613T024634Z >> server that is running netatalk in a zone. >> >> After some investigation with the help of netatalk developer Ralph Böhme, I >> find that the netatalk afpd is spending a very long time in the syscall >> >> getcwd() >> >> in some cases. Ralph provided a Dtrace script which breaks down the time >> spent in various syscalls, and getcwd() seems to be the major bottleneck >> here. I've attached an excerpt of the Dtrace analysis below. >> >> Is there any reason why getcwd() should take a long time? Disk I/O does not >> appear to be an issue. For reference, I got similar results before and after >> starting a zpool scrub on the zones pool, and the scrub is proceeding at 940 >> MB/s. >> >> Further background on the issue is discussed here: >> >> <http://sourceforge.net/p/netatalk/mailman/message/32660961/> >> >> and the full output of the Dtrace script is here: >> >> <https://gist.github.com/ferebee/99016ccd293b3f616c77> >> >> All sugestions are most welcome! > > Hey Chris, > > I'd suggest gathering a kernel flame graph while you're running the > test. It's not immediately obvious to me why getcwd() would be hot, but > what we're seeing is that with getcwd() you have a bimodal distribution. > So you have the ~100 entries that are taking about 250ms (I'm fairly > certain I read the script correctly and it saves everything in us). > > It also appears that the kernel caches the value of the current working > directory, therefore it might be the case that these longer ones are > when you have that cache miss because the working directory changed. > > What I might suggest as a next step is making a kernel flame graph to > correspond with this workload and run it concurrently with the script. > That might allow us to better understand what's going on. You can > generate a flame graph with https://github.com/brendangregg/flamegraph. > > Robert ------------------------------------------- smartos-discuss Archives: https://www.listbox.com/member/archive/184463/=now RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00 Modify Your Subscription: https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb Powered by Listbox: http://www.listbox.com
