[jira] [Commented] (KUDU-2395) Thread spike with all threads blocked in libnss

2019-05-20 Thread Alexey Serbin (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844162#comment-16844162
 ] 

Alexey Serbin commented on KUDU-2395:
-

[~tlipcon] I think adding cache for resolved DNS entries should fix this issue, 
at least cached DNS names I don't expect the number of threads performing DNS 
resolution to go that high.  But it would be nice to add some sort of test for 
that (at least test that scenario once manually).

I'll prioritize revving https://gerrit.cloudera.org/#/c/13266/ this week.  
Thank you for the reminder.

> Thread spike with all threads blocked in libnss
> ---
>
> Key: KUDU-2395
> URL: https://issues.apache.org/jira/browse/KUDU-2395
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus, tserver, util
>Reporter: Todd Lipcon
>Priority: Minor
>
> I saw the thread count on a server under a load test spike from 280 threads 
> (fairly constant) to 3400 threads (briefly). I checked the diagnostics log 
> and found that there are several thousand threads in a stack like:
> {code}
> 0x7facce018606 _nss_files_gethostbyname2_r
>   0x345a703645 
>   0x345a6d0b3b 
>   0x345a6d2d80 
>  0x1c9366c kudu::(anonymous namespace)::GetAddrInfo()
>  0x1c95fbe kudu::HostPort::ResolveAddresses()
>   0xac4b78 kudu::consensus::(anonymous 
> namespace)::CreateConsensusServiceProxyForHost()
>   0xac5058 kudu::consensus::RpcPeerProxyFactory::NewProxy()
>   0xb0b212 kudu::consensus::LeaderElection::LeaderElection()
>   0xafab80 kudu::consensus::RaftConsensus::StartElection()
>   0xafd20c kudu::consensus::RaftConsensus::ReportFailureDetectedTask()
>  0x1ccf4ed kudu::FunctionRunnable::Run()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2395) Thread spike with all threads blocked in libnss

2019-05-20 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844126#comment-16844126
 ] 

Todd Lipcon commented on KUDU-2395:
---

[~aserbin] do we expect this will be fully fixed by KUDU-2791?

> Thread spike with all threads blocked in libnss
> ---
>
> Key: KUDU-2395
> URL: https://issues.apache.org/jira/browse/KUDU-2395
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus, tserver, util
>Reporter: Todd Lipcon
>Priority: Minor
>
> I saw the thread count on a server under a load test spike from 280 threads 
> (fairly constant) to 3400 threads (briefly). I checked the diagnostics log 
> and found that there are several thousand threads in a stack like:
> {code}
> 0x7facce018606 _nss_files_gethostbyname2_r
>   0x345a703645 
>   0x345a6d0b3b 
>   0x345a6d2d80 
>  0x1c9366c kudu::(anonymous namespace)::GetAddrInfo()
>  0x1c95fbe kudu::HostPort::ResolveAddresses()
>   0xac4b78 kudu::consensus::(anonymous 
> namespace)::CreateConsensusServiceProxyForHost()
>   0xac5058 kudu::consensus::RpcPeerProxyFactory::NewProxy()
>   0xb0b212 kudu::consensus::LeaderElection::LeaderElection()
>   0xafab80 kudu::consensus::RaftConsensus::StartElection()
>   0xafd20c kudu::consensus::RaftConsensus::ReportFailureDetectedTask()
>  0x1ccf4ed kudu::FunctionRunnable::Run()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2395) Thread spike with all threads blocked in libnss

2019-05-02 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831928#comment-16831928
 ] 

Todd Lipcon commented on KUDU-2395:
---

Worth noting that this can also occur with stacks in GetLoggedInUser like this:
{code}
  0x7fd25227642b __lll_lock_wait
  0x7fd252271dcb _L_lock_812
  0x7fd252271c98 __GI___pthread_mutex_lock
  0x7fd247cfcfc3 _nss_files_getpwuid_r
  0x7fd2502fc52e __getpwuid_r
   0x1b97356 kudu::GetLoggedInUser()
   0x1a0f453 kudu::rpc::Proxy::Proxy()
{code}

That particular case was fixed by 52b50b7a91c61925a7bc42992fe1001e74425d4d in 
Kudu 1.8, though.

> Thread spike with all threads blocked in libnss
> ---
>
> Key: KUDU-2395
> URL: https://issues.apache.org/jira/browse/KUDU-2395
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus, tserver, util
>Reporter: Todd Lipcon
>Priority: Minor
>
> I saw the thread count on a server under a load test spike from 280 threads 
> (fairly constant) to 3400 threads (briefly). I checked the diagnostics log 
> and found that there are several thousand threads in a stack like:
> {code}
> 0x7facce018606 _nss_files_gethostbyname2_r
>   0x345a703645 
>   0x345a6d0b3b 
>   0x345a6d2d80 
>  0x1c9366c kudu::(anonymous namespace)::GetAddrInfo()
>  0x1c95fbe kudu::HostPort::ResolveAddresses()
>   0xac4b78 kudu::consensus::(anonymous 
> namespace)::CreateConsensusServiceProxyForHost()
>   0xac5058 kudu::consensus::RpcPeerProxyFactory::NewProxy()
>   0xb0b212 kudu::consensus::LeaderElection::LeaderElection()
>   0xafab80 kudu::consensus::RaftConsensus::StartElection()
>   0xafd20c kudu::consensus::RaftConsensus::ReportFailureDetectedTask()
>  0x1ccf4ed kudu::FunctionRunnable::Run()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2395) Thread spike with all threads blocked in libnss

2019-04-23 Thread Grant Henke (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16824293#comment-16824293
 ] 

Grant Henke commented on KUDU-2395:
---

I lowered the priority to Minor given we have a well documented workaround: 
[https://kudu.apache.org/docs/troubleshooting.html#slow_dns_nscd]

I raised this to "critical" a couple days ago to track and improvement to build 
in short TTL cache so that users won't need nscd, but that should be it's own 
Jira. I opened KUDU-2791 to track that. 

> Thread spike with all threads blocked in libnss
> ---
>
> Key: KUDU-2395
> URL: https://issues.apache.org/jira/browse/KUDU-2395
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus, tserver, util
>Reporter: Todd Lipcon
>Priority: Minor
>
> I saw the thread count on a server under a load test spike from 280 threads 
> (fairly constant) to 3400 threads (briefly). I checked the diagnostics log 
> and found that there are several thousand threads in a stack like:
> {code}
> 0x7facce018606 _nss_files_gethostbyname2_r
>   0x345a703645 
>   0x345a6d0b3b 
>   0x345a6d2d80 
>  0x1c9366c kudu::(anonymous namespace)::GetAddrInfo()
>  0x1c95fbe kudu::HostPort::ResolveAddresses()
>   0xac4b78 kudu::consensus::(anonymous 
> namespace)::CreateConsensusServiceProxyForHost()
>   0xac5058 kudu::consensus::RpcPeerProxyFactory::NewProxy()
>   0xb0b212 kudu::consensus::LeaderElection::LeaderElection()
>   0xafab80 kudu::consensus::RaftConsensus::StartElection()
>   0xafd20c kudu::consensus::RaftConsensus::ReportFailureDetectedTask()
>  0x1ccf4ed kudu::FunctionRunnable::Run()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2395) Thread spike with all threads blocked in libnss

2019-04-18 Thread Grant Henke (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16821512#comment-16821512
 ] 

Grant Henke commented on KUDU-2395:
---

In a side conversation it was mentioned we could alleviate this issue and the 
_nscd_ requirement with a built-in short TTL cache. Without an improvement 
_nscd_ is effectively required to scale Kudu clusters reliably. 

> Thread spike with all threads blocked in libnss
> ---
>
> Key: KUDU-2395
> URL: https://issues.apache.org/jira/browse/KUDU-2395
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus, tserver, util
>Reporter: Todd Lipcon
>Priority: Major
>
> I saw the thread count on a server under a load test spike from 280 threads 
> (fairly constant) to 3400 threads (briefly). I checked the diagnostics log 
> and found that there are several thousand threads in a stack like:
> {code}
> 0x7facce018606 _nss_files_gethostbyname2_r
>   0x345a703645 
>   0x345a6d0b3b 
>   0x345a6d2d80 
>  0x1c9366c kudu::(anonymous namespace)::GetAddrInfo()
>  0x1c95fbe kudu::HostPort::ResolveAddresses()
>   0xac4b78 kudu::consensus::(anonymous 
> namespace)::CreateConsensusServiceProxyForHost()
>   0xac5058 kudu::consensus::RpcPeerProxyFactory::NewProxy()
>   0xb0b212 kudu::consensus::LeaderElection::LeaderElection()
>   0xafab80 kudu::consensus::RaftConsensus::StartElection()
>   0xafd20c kudu::consensus::RaftConsensus::ReportFailureDetectedTask()
>  0x1ccf4ed kudu::FunctionRunnable::Run()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2395) Thread spike with all threads blocked in libnss

2018-04-13 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438122#comment-16438122
 ] 

Todd Lipcon commented on KUDU-2395:
---

Looks like YugaByte did some work in a similar area: 
https://github.com/YugaByte/yugabyte-db/commit/9ba84368f3f30ddef0489a81cff6057cd3867bb9

> Thread spike with all threads blocked in libnss
> ---
>
> Key: KUDU-2395
> URL: https://issues.apache.org/jira/browse/KUDU-2395
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus, tserver, util
>Reporter: Todd Lipcon
>Priority: Major
>
> I saw the thread count on a server under a load test spike from 280 threads 
> (fairly constant) to 3400 threads (briefly). I checked the diagnostics log 
> and found that there are several thousand threads in a stack like:
> {code}
> 0x7facce018606 _nss_files_gethostbyname2_r
>   0x345a703645 
>   0x345a6d0b3b 
>   0x345a6d2d80 
>  0x1c9366c kudu::(anonymous namespace)::GetAddrInfo()
>  0x1c95fbe kudu::HostPort::ResolveAddresses()
>   0xac4b78 kudu::consensus::(anonymous 
> namespace)::CreateConsensusServiceProxyForHost()
>   0xac5058 kudu::consensus::RpcPeerProxyFactory::NewProxy()
>   0xb0b212 kudu::consensus::LeaderElection::LeaderElection()
>   0xafab80 kudu::consensus::RaftConsensus::StartElection()
>   0xafd20c kudu::consensus::RaftConsensus::ReportFailureDetectedTask()
>  0x1ccf4ed kudu::FunctionRunnable::Run()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2395) Thread spike with all threads blocked in libnss

2018-03-30 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421093#comment-16421093
 ] 

Todd Lipcon commented on KUDU-2395:
---

I did a bit of symbolization of the unsymbolized libc symbols in the trace:

- all of the threads inside _nss_* are stuck in __pthread_mutex_lock except for 
one.
- that one is in '_nss_files_gethostbyname2_r' calling 'fgets_unlocked' (I 
guess reading /etc/hosts)

So, essentially, DNS resolution is single-threaded inside libnss and this 
causes a large bottleneck when a bunch of threads need to create proxies during 
a detected failure.

The thing that seems to have precipitated all of this was a short process-wide 
blip:
{code}
W0330 16:29:58.896669  8975 net_util.cc:159] Time spent resolve address for 
vc1310.halxg.cloudera.com: real 1.003s  user 0.001s sys 0.000s
W0330 16:29:58.899119  4840 thread.cc:554] raft [worker] (thread pool) Time 
spent creating pthread: real 0.967s user 0.001s sys 0.000s
W0330 16:29:58.899160  4840 thread.cc:521] raft [worker] (thread pool) Time 
spent starting thread: real 0.967s  user 0.001s sys 0.000s
W0330 16:29:58.899212  4839 thread.cc:554] raft [worker] (thread pool) Time 
spent creating pthread: real 0.983s user 0.000s sys 0.000s
W0330 16:29:58.899235  4841 thread.cc:554] raft [worker] (thread pool) Time 
spent creating pthread: real 0.954s user 0.001s sys 0.000s
W0330 16:29:58.899217  4842 thread.cc:554] raft [worker] (thread pool) Time 
spent creating pthread: real 0.973s user 0.000s sys 0.000s
W0330 16:29:58.899281  4841 thread.cc:521] raft [worker] (thread pool) Time 
spent starting thread: real 0.954s  user 0.001s sys 0.000s
{code}

This machine is el6 and does have THP enabled, so perhaps khugepaged attacked 
the process and caused the 1sec blip. But then the resulting disaster was 
mostly self-inflicted.

> Thread spike with all threads blocked in libnss
> ---
>
> Key: KUDU-2395
> URL: https://issues.apache.org/jira/browse/KUDU-2395
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus, tserver, util
>Reporter: Todd Lipcon
>Priority: Major
>
> I saw the thread count on a server under a load test spike from 280 threads 
> (fairly constant) to 3400 threads (briefly). I checked the diagnostics log 
> and found that there are several thousand threads in a stack like:
> {code}
> 0x7facce018606 _nss_files_gethostbyname2_r
>   0x345a703645 
>   0x345a6d0b3b 
>   0x345a6d2d80 
>  0x1c9366c kudu::(anonymous namespace)::GetAddrInfo()
>  0x1c95fbe kudu::HostPort::ResolveAddresses()
>   0xac4b78 kudu::consensus::(anonymous 
> namespace)::CreateConsensusServiceProxyForHost()
>   0xac5058 kudu::consensus::RpcPeerProxyFactory::NewProxy()
>   0xb0b212 kudu::consensus::LeaderElection::LeaderElection()
>   0xafab80 kudu::consensus::RaftConsensus::StartElection()
>   0xafd20c kudu::consensus::RaftConsensus::ReportFailureDetectedTask()
>  0x1ccf4ed kudu::FunctionRunnable::Run()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2395) Thread spike with all threads blocked in libnss

2018-03-30 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421089#comment-16421089
 ] 

Todd Lipcon commented on KUDU-2395:
---

It's worth noting this server does not have nscd running. We should probably 
recommend usage of 'nscd' and consider implementing our own DNS cache of some 
sort. KUDU-75 is also relevant, which would make this asynchronous and avoid 
creating thousands of threads even if DNS is slow.

> Thread spike with all threads blocked in libnss
> ---
>
> Key: KUDU-2395
> URL: https://issues.apache.org/jira/browse/KUDU-2395
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus, tserver, util
>Reporter: Todd Lipcon
>Priority: Major
>
> I saw the thread count on a server under a load test spike from 280 threads 
> (fairly constant) to 3400 threads (briefly). I checked the diagnostics log 
> and found that there are several thousand threads in a stack like:
> {code}
> 0x7facce018606 _nss_files_gethostbyname2_r
>   0x345a703645 
>   0x345a6d0b3b 
>   0x345a6d2d80 
>  0x1c9366c kudu::(anonymous namespace)::GetAddrInfo()
>  0x1c95fbe kudu::HostPort::ResolveAddresses()
>   0xac4b78 kudu::consensus::(anonymous 
> namespace)::CreateConsensusServiceProxyForHost()
>   0xac5058 kudu::consensus::RpcPeerProxyFactory::NewProxy()
>   0xb0b212 kudu::consensus::LeaderElection::LeaderElection()
>   0xafab80 kudu::consensus::RaftConsensus::StartElection()
>   0xafd20c kudu::consensus::RaftConsensus::ReportFailureDetectedTask()
>  0x1ccf4ed kudu::FunctionRunnable::Run()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)