[jira] [Commented] (KUDU-2395) Thread spike with all threads blocked in libnss
[ https://issues.apache.org/jira/browse/KUDU-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844162#comment-16844162 ] Alexey Serbin commented on KUDU-2395: - [~tlipcon] I think adding cache for resolved DNS entries should fix this issue, at least cached DNS names I don't expect the number of threads performing DNS resolution to go that high. But it would be nice to add some sort of test for that (at least test that scenario once manually). I'll prioritize revving https://gerrit.cloudera.org/#/c/13266/ this week. Thank you for the reminder. > Thread spike with all threads blocked in libnss > --- > > Key: KUDU-2395 > URL: https://issues.apache.org/jira/browse/KUDU-2395 > Project: Kudu > Issue Type: Bug > Components: consensus, tserver, util >Reporter: Todd Lipcon >Priority: Minor > > I saw the thread count on a server under a load test spike from 280 threads > (fairly constant) to 3400 threads (briefly). I checked the diagnostics log > and found that there are several thousand threads in a stack like: > {code} > 0x7facce018606 _nss_files_gethostbyname2_r > 0x345a703645 > 0x345a6d0b3b > 0x345a6d2d80 > 0x1c9366c kudu::(anonymous namespace)::GetAddrInfo() > 0x1c95fbe kudu::HostPort::ResolveAddresses() > 0xac4b78 kudu::consensus::(anonymous > namespace)::CreateConsensusServiceProxyForHost() > 0xac5058 kudu::consensus::RpcPeerProxyFactory::NewProxy() > 0xb0b212 kudu::consensus::LeaderElection::LeaderElection() > 0xafab80 kudu::consensus::RaftConsensus::StartElection() > 0xafd20c kudu::consensus::RaftConsensus::ReportFailureDetectedTask() > 0x1ccf4ed kudu::FunctionRunnable::Run() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2395) Thread spike with all threads blocked in libnss
[ https://issues.apache.org/jira/browse/KUDU-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844126#comment-16844126 ] Todd Lipcon commented on KUDU-2395: --- [~aserbin] do we expect this will be fully fixed by KUDU-2791? > Thread spike with all threads blocked in libnss > --- > > Key: KUDU-2395 > URL: https://issues.apache.org/jira/browse/KUDU-2395 > Project: Kudu > Issue Type: Bug > Components: consensus, tserver, util >Reporter: Todd Lipcon >Priority: Minor > > I saw the thread count on a server under a load test spike from 280 threads > (fairly constant) to 3400 threads (briefly). I checked the diagnostics log > and found that there are several thousand threads in a stack like: > {code} > 0x7facce018606 _nss_files_gethostbyname2_r > 0x345a703645 > 0x345a6d0b3b > 0x345a6d2d80 > 0x1c9366c kudu::(anonymous namespace)::GetAddrInfo() > 0x1c95fbe kudu::HostPort::ResolveAddresses() > 0xac4b78 kudu::consensus::(anonymous > namespace)::CreateConsensusServiceProxyForHost() > 0xac5058 kudu::consensus::RpcPeerProxyFactory::NewProxy() > 0xb0b212 kudu::consensus::LeaderElection::LeaderElection() > 0xafab80 kudu::consensus::RaftConsensus::StartElection() > 0xafd20c kudu::consensus::RaftConsensus::ReportFailureDetectedTask() > 0x1ccf4ed kudu::FunctionRunnable::Run() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2395) Thread spike with all threads blocked in libnss
[ https://issues.apache.org/jira/browse/KUDU-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831928#comment-16831928 ] Todd Lipcon commented on KUDU-2395: --- Worth noting that this can also occur with stacks in GetLoggedInUser like this: {code} 0x7fd25227642b __lll_lock_wait 0x7fd252271dcb _L_lock_812 0x7fd252271c98 __GI___pthread_mutex_lock 0x7fd247cfcfc3 _nss_files_getpwuid_r 0x7fd2502fc52e __getpwuid_r 0x1b97356 kudu::GetLoggedInUser() 0x1a0f453 kudu::rpc::Proxy::Proxy() {code} That particular case was fixed by 52b50b7a91c61925a7bc42992fe1001e74425d4d in Kudu 1.8, though. > Thread spike with all threads blocked in libnss > --- > > Key: KUDU-2395 > URL: https://issues.apache.org/jira/browse/KUDU-2395 > Project: Kudu > Issue Type: Bug > Components: consensus, tserver, util >Reporter: Todd Lipcon >Priority: Minor > > I saw the thread count on a server under a load test spike from 280 threads > (fairly constant) to 3400 threads (briefly). I checked the diagnostics log > and found that there are several thousand threads in a stack like: > {code} > 0x7facce018606 _nss_files_gethostbyname2_r > 0x345a703645 > 0x345a6d0b3b > 0x345a6d2d80 > 0x1c9366c kudu::(anonymous namespace)::GetAddrInfo() > 0x1c95fbe kudu::HostPort::ResolveAddresses() > 0xac4b78 kudu::consensus::(anonymous > namespace)::CreateConsensusServiceProxyForHost() > 0xac5058 kudu::consensus::RpcPeerProxyFactory::NewProxy() > 0xb0b212 kudu::consensus::LeaderElection::LeaderElection() > 0xafab80 kudu::consensus::RaftConsensus::StartElection() > 0xafd20c kudu::consensus::RaftConsensus::ReportFailureDetectedTask() > 0x1ccf4ed kudu::FunctionRunnable::Run() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2395) Thread spike with all threads blocked in libnss
[ https://issues.apache.org/jira/browse/KUDU-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16824293#comment-16824293 ] Grant Henke commented on KUDU-2395: --- I lowered the priority to Minor given we have a well documented workaround: [https://kudu.apache.org/docs/troubleshooting.html#slow_dns_nscd] I raised this to "critical" a couple days ago to track and improvement to build in short TTL cache so that users won't need nscd, but that should be it's own Jira. I opened KUDU-2791 to track that. > Thread spike with all threads blocked in libnss > --- > > Key: KUDU-2395 > URL: https://issues.apache.org/jira/browse/KUDU-2395 > Project: Kudu > Issue Type: Bug > Components: consensus, tserver, util >Reporter: Todd Lipcon >Priority: Minor > > I saw the thread count on a server under a load test spike from 280 threads > (fairly constant) to 3400 threads (briefly). I checked the diagnostics log > and found that there are several thousand threads in a stack like: > {code} > 0x7facce018606 _nss_files_gethostbyname2_r > 0x345a703645 > 0x345a6d0b3b > 0x345a6d2d80 > 0x1c9366c kudu::(anonymous namespace)::GetAddrInfo() > 0x1c95fbe kudu::HostPort::ResolveAddresses() > 0xac4b78 kudu::consensus::(anonymous > namespace)::CreateConsensusServiceProxyForHost() > 0xac5058 kudu::consensus::RpcPeerProxyFactory::NewProxy() > 0xb0b212 kudu::consensus::LeaderElection::LeaderElection() > 0xafab80 kudu::consensus::RaftConsensus::StartElection() > 0xafd20c kudu::consensus::RaftConsensus::ReportFailureDetectedTask() > 0x1ccf4ed kudu::FunctionRunnable::Run() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2395) Thread spike with all threads blocked in libnss
[ https://issues.apache.org/jira/browse/KUDU-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821512#comment-16821512 ] Grant Henke commented on KUDU-2395: --- In a side conversation it was mentioned we could alleviate this issue and the _nscd_ requirement with a built-in short TTL cache. Without an improvement _nscd_ is effectively required to scale Kudu clusters reliably. > Thread spike with all threads blocked in libnss > --- > > Key: KUDU-2395 > URL: https://issues.apache.org/jira/browse/KUDU-2395 > Project: Kudu > Issue Type: Bug > Components: consensus, tserver, util >Reporter: Todd Lipcon >Priority: Major > > I saw the thread count on a server under a load test spike from 280 threads > (fairly constant) to 3400 threads (briefly). I checked the diagnostics log > and found that there are several thousand threads in a stack like: > {code} > 0x7facce018606 _nss_files_gethostbyname2_r > 0x345a703645 > 0x345a6d0b3b > 0x345a6d2d80 > 0x1c9366c kudu::(anonymous namespace)::GetAddrInfo() > 0x1c95fbe kudu::HostPort::ResolveAddresses() > 0xac4b78 kudu::consensus::(anonymous > namespace)::CreateConsensusServiceProxyForHost() > 0xac5058 kudu::consensus::RpcPeerProxyFactory::NewProxy() > 0xb0b212 kudu::consensus::LeaderElection::LeaderElection() > 0xafab80 kudu::consensus::RaftConsensus::StartElection() > 0xafd20c kudu::consensus::RaftConsensus::ReportFailureDetectedTask() > 0x1ccf4ed kudu::FunctionRunnable::Run() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2395) Thread spike with all threads blocked in libnss
[ https://issues.apache.org/jira/browse/KUDU-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438122#comment-16438122 ] Todd Lipcon commented on KUDU-2395: --- Looks like YugaByte did some work in a similar area: https://github.com/YugaByte/yugabyte-db/commit/9ba84368f3f30ddef0489a81cff6057cd3867bb9 > Thread spike with all threads blocked in libnss > --- > > Key: KUDU-2395 > URL: https://issues.apache.org/jira/browse/KUDU-2395 > Project: Kudu > Issue Type: Bug > Components: consensus, tserver, util >Reporter: Todd Lipcon >Priority: Major > > I saw the thread count on a server under a load test spike from 280 threads > (fairly constant) to 3400 threads (briefly). I checked the diagnostics log > and found that there are several thousand threads in a stack like: > {code} > 0x7facce018606 _nss_files_gethostbyname2_r > 0x345a703645 > 0x345a6d0b3b > 0x345a6d2d80 > 0x1c9366c kudu::(anonymous namespace)::GetAddrInfo() > 0x1c95fbe kudu::HostPort::ResolveAddresses() > 0xac4b78 kudu::consensus::(anonymous > namespace)::CreateConsensusServiceProxyForHost() > 0xac5058 kudu::consensus::RpcPeerProxyFactory::NewProxy() > 0xb0b212 kudu::consensus::LeaderElection::LeaderElection() > 0xafab80 kudu::consensus::RaftConsensus::StartElection() > 0xafd20c kudu::consensus::RaftConsensus::ReportFailureDetectedTask() > 0x1ccf4ed kudu::FunctionRunnable::Run() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2395) Thread spike with all threads blocked in libnss
[ https://issues.apache.org/jira/browse/KUDU-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421093#comment-16421093 ] Todd Lipcon commented on KUDU-2395: --- I did a bit of symbolization of the unsymbolized libc symbols in the trace: - all of the threads inside _nss_* are stuck in __pthread_mutex_lock except for one. - that one is in '_nss_files_gethostbyname2_r' calling 'fgets_unlocked' (I guess reading /etc/hosts) So, essentially, DNS resolution is single-threaded inside libnss and this causes a large bottleneck when a bunch of threads need to create proxies during a detected failure. The thing that seems to have precipitated all of this was a short process-wide blip: {code} W0330 16:29:58.896669 8975 net_util.cc:159] Time spent resolve address for vc1310.halxg.cloudera.com: real 1.003s user 0.001s sys 0.000s W0330 16:29:58.899119 4840 thread.cc:554] raft [worker] (thread pool) Time spent creating pthread: real 0.967s user 0.001s sys 0.000s W0330 16:29:58.899160 4840 thread.cc:521] raft [worker] (thread pool) Time spent starting thread: real 0.967s user 0.001s sys 0.000s W0330 16:29:58.899212 4839 thread.cc:554] raft [worker] (thread pool) Time spent creating pthread: real 0.983s user 0.000s sys 0.000s W0330 16:29:58.899235 4841 thread.cc:554] raft [worker] (thread pool) Time spent creating pthread: real 0.954s user 0.001s sys 0.000s W0330 16:29:58.899217 4842 thread.cc:554] raft [worker] (thread pool) Time spent creating pthread: real 0.973s user 0.000s sys 0.000s W0330 16:29:58.899281 4841 thread.cc:521] raft [worker] (thread pool) Time spent starting thread: real 0.954s user 0.001s sys 0.000s {code} This machine is el6 and does have THP enabled, so perhaps khugepaged attacked the process and caused the 1sec blip. But then the resulting disaster was mostly self-inflicted. > Thread spike with all threads blocked in libnss > --- > > Key: KUDU-2395 > URL: https://issues.apache.org/jira/browse/KUDU-2395 > Project: Kudu > Issue Type: Bug > Components: consensus, tserver, util >Reporter: Todd Lipcon >Priority: Major > > I saw the thread count on a server under a load test spike from 280 threads > (fairly constant) to 3400 threads (briefly). I checked the diagnostics log > and found that there are several thousand threads in a stack like: > {code} > 0x7facce018606 _nss_files_gethostbyname2_r > 0x345a703645 > 0x345a6d0b3b > 0x345a6d2d80 > 0x1c9366c kudu::(anonymous namespace)::GetAddrInfo() > 0x1c95fbe kudu::HostPort::ResolveAddresses() > 0xac4b78 kudu::consensus::(anonymous > namespace)::CreateConsensusServiceProxyForHost() > 0xac5058 kudu::consensus::RpcPeerProxyFactory::NewProxy() > 0xb0b212 kudu::consensus::LeaderElection::LeaderElection() > 0xafab80 kudu::consensus::RaftConsensus::StartElection() > 0xafd20c kudu::consensus::RaftConsensus::ReportFailureDetectedTask() > 0x1ccf4ed kudu::FunctionRunnable::Run() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2395) Thread spike with all threads blocked in libnss
[ https://issues.apache.org/jira/browse/KUDU-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421089#comment-16421089 ] Todd Lipcon commented on KUDU-2395: --- It's worth noting this server does not have nscd running. We should probably recommend usage of 'nscd' and consider implementing our own DNS cache of some sort. KUDU-75 is also relevant, which would make this asynchronous and avoid creating thousands of threads even if DNS is slow. > Thread spike with all threads blocked in libnss > --- > > Key: KUDU-2395 > URL: https://issues.apache.org/jira/browse/KUDU-2395 > Project: Kudu > Issue Type: Bug > Components: consensus, tserver, util >Reporter: Todd Lipcon >Priority: Major > > I saw the thread count on a server under a load test spike from 280 threads > (fairly constant) to 3400 threads (briefly). I checked the diagnostics log > and found that there are several thousand threads in a stack like: > {code} > 0x7facce018606 _nss_files_gethostbyname2_r > 0x345a703645 > 0x345a6d0b3b > 0x345a6d2d80 > 0x1c9366c kudu::(anonymous namespace)::GetAddrInfo() > 0x1c95fbe kudu::HostPort::ResolveAddresses() > 0xac4b78 kudu::consensus::(anonymous > namespace)::CreateConsensusServiceProxyForHost() > 0xac5058 kudu::consensus::RpcPeerProxyFactory::NewProxy() > 0xb0b212 kudu::consensus::LeaderElection::LeaderElection() > 0xafab80 kudu::consensus::RaftConsensus::StartElection() > 0xafd20c kudu::consensus::RaftConsensus::ReportFailureDetectedTask() > 0x1ccf4ed kudu::FunctionRunnable::Run() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)