I caught up with King Lee on Slack over the weekend, but wanted to reply
here so that if anyone else in the community hits similar issues they might
check on the same things.

- I asked for a dump of /metrics and /rpcz from the TS web server.

>From this I could see that a lot of the consensus requests were stuck a
long time waiting on various locks.

- I asked for 10 'pstack's separated by 10 seconds using the following
command:
for x in $(seq 1 10) ; do pstack $(pgrep kudu-tserver) > /tmp/stacks.$x.txt
; sleep 10 ; done

In the stacks I found some interesting stuff:

1) Many threads in the process of leader elections
- King Lee has an SSD available on his tablet server boxes, so I gave him a
script to migrate the consensus-meta/ and tablet-meta/ directories to the
SSD and replace them with symlinks on the original drive. This is basically
a manual workaround for https://issues.apache.org/jira/browse/KUDU-1489

2) A lot of threads blocked in DNS lookups
- He's using /etc/hosts only and not actual DNS.
- The stack looks like:

Thread 770 (Thread 0x7fa8ecd0d700 (LWP 7365)):
#0  0x00007fabc504e334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fabc50495d8 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x00007fabc50494a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007fabbcfe2e07 in _nss_files_gethostbyname2_r () from
/lib64/libnss_files.so.2
#4  0x00007fabc3fed915 in gethostbyname2_r@@GLIBC_2.2.5 () from
/lib64/libc.so.6
#5  0x00007fabc3fbad62 in gaih_inet () from /lib64/libc.so.6
#6  0x00007fabc3fbcfbf in getaddrinfo () from /lib64/libc.so.6
#7  0x0000000001b3bd06 in
kudu::HostPort::ResolveAddresses(std::vector<kudu::Sockaddr,
std::allocator<kudu::Sockaddr> >*) const ()

It seems like at least this version of glibc (from el6) has a lock around
the '/etc/hosts'-based implementation of gethostbyname. It still struck me
as odd that looking up a host in /etc/hosts would be slow, but I also found
this stack:

Thread 496 (Thread 0x7fa65d7bb700 (LWP 28798)):
#0  0x00007fabc3fcf59a in mmap64 () from /lib64/libc.so.6
#1  0x00007fabc3f5015c in _IO_file_doallocate_internal () from
/lib64/libc.so.6
#2  0x00007fabc3f5d61c in _IO_doallocbuf_internal () from /lib64/libc.so.6
#3  0x00007fabc3f5bbd4 in _IO_new_file_underflow () from /lib64/libc.so.6
#4  0x00007fabc3f5d65e in _IO_default_uflow_internal () from
/lib64/libc.so.6
#5  0x00007fabc3f51d3a in _IO_getline_info_internal () from /lib64/libc.so.6
#6  0x00007fabc3f5ab71 in fgets_unlocked () from /lib64/libc.so.6
#7  0x00007fabbcfe1ab1 in internal_getent () from /lib64/libnss_files.so.2
#8  0x00007fabbcfe2e96 in _nss_files_gethostbyname2_r () from
/lib64/libnss_files.so.2
#9  0x00007fabc3fed915 in gethostbyname2_r@@GLIBC_2.2.5 () from
/lib64/libc.so.6
#10 0x00007fabc3fbad62 in gaih_inet () from /lib64/libc.so.6
#11 0x00007fabc3fbcfbf in getaddrinfo () from /lib64/libc.so.6

I dug into the source for el6's glibc and found that, indeed, it uses
mmap() to read /etc/hosts on every host resolution.

This led me to question why mmap was so slow. The most common reason I've
seen for that is when transparent hugepages are enabled on el6. It turns
out King's system does have THP enabled, so I asked him to disable it. See
https://issues.apache.org/jira/browse/KUDU-1447

- The last thing I noticed were a bunch of threads stuck in this stack:

Thread 54 (Thread 0x7fabb95e5700 (LWP 29901)):
#0  0x00007fabc504e334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fabc50495d8 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x00007fabc50494a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000001b3669f in kudu::Mutex::Acquire() ()
#4  0x0000000001b17193 in kudu::KernelStackWatchdog::Unregister() ()
#5  0x0000000001b1736a in void
kudu::threadlocal::internal::Destroy<kudu::KernelStackWatchdog::TLS>(void*)
()
#6  0x0000000001b6c316 in
kudu::threadlocal::internal::InvokeDestructors(void*) ()
#7  0x00007fabc5047b39 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fabc3fd2bcd in clone () from /lib64/libc.so.6

I filed KUDU-2215 for this one. This particular stuck thread issue wouldn't
cause problems in Kudu 1.5, but in 1.4 and earlier the slow shutdown of a
thread could cause consensus to get "stuck" as well in a stack like this:
Thread 2137 (Thread 0x7fa4e4c1c700 (LWP 1564)):
#0  0x00007fabc504ba5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x0000000001ad9cdd in
kudu::ConditionVariable::TimedWait(kudu::MonoDelta const&) const ()
#2  0x0000000001b63d0c in kudu::ThreadJoiner::Join() ()
#3  0x0000000001b55b4c in kudu::ResettableHeartbeater::Stop() ()
#4  0x0000000000a65b42 in kudu::consensus::Peer::Close() ()
#5  0x0000000000a819b9 in kudu::consensus::PeerManager::Close() ()
#6  0x0000000000a45dde in
kudu::consensus::RaftConsensus::BecomeReplicaUnlocked() ()
#7  0x0000000000a51063 in
kudu::consensus::RaftConsensus::HandleTermAdvanceUnlocked(long,
kudu::consensus::ReplicaState::FlushToDisk) ()
#8  0x0000000000a5145f in
kudu::consensus::RaftConsensus::NotifyTermChange(long) ()
#9  0x0000000000a6ceac in
kudu::consensus::PeerMessageQueue::NotifyObserversOfTermChangeTask(long) ()

which I also found in his stacks fairly often.

To workaround this I suggested disabling the kernel_stack_watchdog using a
hidden configuration. I'll look at a fix for this that we can backport to
earlier versions as well.

King -- please report back in a couple days whether the suggestions we
talked about help resolve the problems on your cluster.

Thanks
-Todd





On Thu, Nov 9, 2017 at 11:27 PM, Lee King <yuyunliu...@gmail.com> wrote:

> Hi, Todd
>     Our kudu cluster 's error/warning log just like the
> https://issues.apache.org/jira/browse/KUDU-1078, and the issues's status
> is reopen, I have upload log for analysis the issues, If you want to more
> detail, just tell me 😄。
> log files:
> https://drive.google.com/open?id=1_1l2xpT3-NmumgI_sIdxch-6BocXqTCt
> https://drive.google.com/open?id=0B4-NyGFtYNboN3NYNW1pVWQwcFVLa083V
> kRIUTZHRk85WHY4
>
>
>
>
>
>
> 2017-11-04 13:15 GMT+08:00 Todd Lipcon <t...@cloudera.com>:
>
>> One thing you might try is to update the consensus rpc timeout to 30
>> seconds instead of 1. We changed the default in later versions.
>>
>> I'd also recommend updating up 1.4 or 1.5 for other related fixes to
>> consensus stability. I think I recall you were on 1.3 still?
>>
>> Todd
>>
>>
>> On Nov 3, 2017 7:47 PM, "Lee King" <yuyunliu...@gmail.com> wrote:
>>
>> Hi,
>>     Our kudu cluster have ran well a long time,  but write became slowly
>> recently,client also come out rpc timeout. I check the warning and find
>> vast error look this:
>> W1104 10:25:16.833736 10271 consensus_peers.cc:365] T
>> 149ffa58ac274c9ba8385ccfdc01ea14 P 59c768eb799243678ee7fa3f83801316 ->
>> Peer 1c67a7e7ff8f4de494469766641fccd1 (cloud-sk-ds-08:7050): Couldn't
>> send request to peer 1c67a7e7ff8f4de494469766641fccd1 for tablet
>> 149ffa58ac274c9ba8385ccfdc01ea14. Status: Timed out: UpdateConsensus RPC
>> to 10.6.60.9:7050 timed out after 1.000s (SENT). Retrying in the next
>> heartbeat period. Already tried 5 times.
>>     I change the configure rpc_service_queue_le
>> ngth=400,rpc_num_service_threads=40, but it takes no effect.
>>     Our cluster include 5 master , 10 ts. 3800G data, 800 tablet per ts.
>> I check one of the ts machine's memory, 14G left(128 In all), thread
>> 4739(max 32000), openfile 28000(max 65536), cpu disk utilization ratio
>> about 30%(32 core), disk util  less than 30%.
>>     Any suggestion for this? Thanks!
>>
>>
>>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to