Robert,
The script you suggested outputs the following, with the server mostly idle and
one OS X client spinning in the Finder, doing its thing to cause the flood of
getcwd() calls.
# dtrace -n 'fbt::dnlc_reverse_lookup:entry{ self->t = timestamp; }' -n
'fbt::dnlc_reverse_lookup:return/self->t/{ @[arg1 == 0 ? 0 : 1] =
quantize(timestamp - self->t); self->t = 0; }'
dtrace: description 'fbt::dnlc_reverse_lookup:entry' matched 1 probe
dtrace: description 'fbt::dnlc_reverse_lookup:return' matched 1 probe
^C
1
value ------------- Distribution ------------- count
1048576 | 0
2097152 |@@@@@@@@ 240
4194304 | 8
8388608 |@@@@@@@@@@@@@@@@@@@@@ 636
16777216 |@@@@@@@@@@@ 348
33554432 | 0
0
value ------------- Distribution ------------- count
8388608 | 0
16777216 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 971
33554432 | 2
67108864 | 0
I don't yet understand how this relates to the results from Ralph's afp.d
Dtrace script, which shows getcwd() taking either ~ 50 us or 250 ms, with a 3:1
bimodal distribution. Does getcwd() call dnlc_reverse_lookup always, or just
sometimes?
How might the size of the dnlc affect things? I understand it can be adjusted
by putting
set ncsize=nnn
into /etc/system, but I have also seen suggestions that it can be dynamically
tuned like this:
# echo dnlc_max_nentries/W0t10485760 | mdb -kw
# echo ncsize/W0t5240000 | mdb -kw
(Is that safe or advisable? Does it work? I'm hesitant to try, but it would be
convenient not to reboot the server, not to mention the difficulty of modifying
/etc/system on SmartOS.)
Currently, ncsize is set to 444,719 (SmartOS default?), and dnlc_nentries seems
to slowly fluctuate between 200,000 and 300,000.
# echo ncsize/D | mdb -kw
ncsize: 444719 # and similarly for
dnlc_nentries: 272139
dnlc_max_nentries: 889438
dnlc_nentries_low_water: 440316
Dnlc cache hits seem to be around 50% per
# vmstat -s
5079608454 total name lookups (cache hits 50%)
and
# kstat -n dnlcstats
module: unix instance: 0
name: dnlcstats class: misc
crtime 8,600642349
dir_add_abort 0
dir_add_max 0
dir_add_no_memory 0
dir_cached_current 2
dir_cached_total 2
dir_entries_cached_current 1094
dir_fini_purge 0
dir_hits 0
dir_misses 19498
dir_reclaim_any 0
dir_reclaim_last 0
dir_remove_entry_fail 0
dir_remove_space_fail 0
dir_start_no_memory 0
dir_update_fail 0
double_enters 14
enters 1858063
hits 2546253549
misses 2531297654
negative_cache_hits 22475516
pick_free 0
pick_heuristic 995275
pick_last 3997
purge_all 0
purge_fs1 0
purge_total_entries 1786
purge_vfs 44
purge_vp 67
snaptime 3016218,120526680
Thanks for insights so far!
Best,
Chris
Am 01.08.2014 um 17:07 schrieb Robert Mustacchi via smartos-discuss
<[email protected]>:
> On 07/31/2014 11:15 PM, Chris Ferebee via smartos-discuss wrote:
>> Hi Robert,
>>
>> The flame graph is here:
>>
>> <http://fere.be/p/netatalk-flamegraph.svg>
>>
>> with corresponding output from Ralph's Dtrace script here:
>>
>> <http://fere.be/p/netatalk.afp.d.out>
>>
>> The raw kernel Dtrace output of
>>
>> dtrace -x stackframes=100 -n 'profile-997 /arg0/ { @[stack()] = count(); }
>> tick-60s { exit(0); }' -o netatalk.kern_stacks-dtrace.out
>>
>> is here:
>>
>> <http://fere.be/p/netatalk.kern_stacks-dtrace.out>
>>
>> So if I understand things up to this point, the kernel is spending a lot of
>> time in dnlc_reverse_lookup. Does that tell you anything?
>
> Hey Chris,
>
> I'm not terribly familiar with how the dnlc is designed. So what'd be
> interesting here is to correlate this with the number of hits versus
> misses in the dnlc.
>
> The questions I want to answer are something along the lines of:
>
> o Are we spending a lot of time in the dnlc just to then miss?
> o Are we spending a lot of time in the dnlc scanning it because it's
> just really large and we're not hashing across it well?
> o Is something else entirely going on?
>
> Of course, the comment of the function has the following:
>
> /*
> * Perform a reverse lookup in the DNLC. This will find the first
> occurrence of
> * the vnode. If successful, it will return the vnode of the parent,
> and the
> * name of the entry in the given buffer. If it cannot be found, or the
> buffer
> * is too small, then it will return NULL. Note that this is a highly
> * inefficient function, since the DNLC is constructed solely for forward
> * lookups.
> */
>
> Which doesn't bode well. But perhaps, what I might go and do is
> something like:
>
> dtrace -n 'fbt::dnlc_reverse_lookup:entry{ self->t = timestamp; }' -n
> 'fbt::dnlc_reverse_lookup:return/self->t/{ @[arg1 == 0 ? 0 : 1] =
> quantize(timestamp - self->t); self->t = 0; }'
>
> Robert
>
>>> Am 31.07.2014 um 17:29 schrieb Robert Mustacchi <[email protected]>:
>>>
>>>> On 07/31/2014 08:15 AM, Chris Ferebee via smartos-discuss wrote:
>>>> I'm trying to debug performance issues on a SmartOS
>>>> joyent_20140613T024634Z server that is running netatalk in a zone.
>>>>
>>>> After some investigation with the help of netatalk developer Ralph Böhme,
>>>> I find that the netatalk afpd is spending a very long time in the syscall
>>>>
>>>> getcwd()
>>>>
>>>> in some cases. Ralph provided a Dtrace script which breaks down the time
>>>> spent in various syscalls, and getcwd() seems to be the major bottleneck
>>>> here. I've attached an excerpt of the Dtrace analysis below.
>>>>
>>>> Is there any reason why getcwd() should take a long time? Disk I/O does
>>>> not appear to be an issue. For reference, I got similar results before and
>>>> after starting a zpool scrub on the zones pool, and the scrub is
>>>> proceeding at 940 MB/s.
>>>>
>>>> Further background on the issue is discussed here:
>>>>
>>>> <http://sourceforge.net/p/netatalk/mailman/message/32660961/>
>>>>
>>>> and the full output of the Dtrace script is here:
>>>>
>>>> <https://gist.github.com/ferebee/99016ccd293b3f616c77>
>>>>
>>>> All sugestions are most welcome!
>>>
>>> Hey Chris,
>>>
>>> I'd suggest gathering a kernel flame graph while you're running the
>>> test. It's not immediately obvious to me why getcwd() would be hot, but
>>> what we're seeing is that with getcwd() you have a bimodal distribution.
>>> So you have the ~100 entries that are taking about 250ms (I'm fairly
>>> certain I read the script correctly and it saves everything in us).
>>>
>>> It also appears that the kernel caches the value of the current working
>>> directory, therefore it might be the case that these longer ones are
>>> when you have that cache miss because the working directory changed.
>>>
>>> What I might suggest as a next step is making a kernel flame graph to
>>> correspond with this workload and run it concurrently with the script.
>>> That might allow us to better understand what's going on. You can
>>> generate a flame graph with https://github.com/brendangregg/flamegraph.
>>>
>>> Robert
>>
>>
>>
>> -------------------------------------------
>> smartos-discuss
>> Archives: https://www.listbox.com/member/archive/184463/=now
>> RSS Feed: https://www.listbox.com/member/archive/rss/184463/21483261-4b78dd38
>> Modify Your Subscription: https://www.listbox.com/member/?&
>> Powered by Listbox: http://www.listbox.com
>>
>
>
>
> -------------------------------------------
> smartos-discuss
> Archives: https://www.listbox.com/member/archive/184463/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/184463/24804823-eebbfb1e
> Modify Your Subscription: https://www.listbox.com/member/?&
> Powered by Listbox: http://www.listbox.com
-------------------------------------------
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription:
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com