[jira] [Updated] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows
[ https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2727: Fix Version/s: 1.13.0 Resolution: Fixed Status: Resolved (was: In Review) > Contention on the Raft consensus lock can cause tablet service queue overflows > -- > > Key: KUDU-2727 > URL: https://issues.apache.org/jira/browse/KUDU-2727 > Project: Kudu > Issue Type: Improvement > Components: consensus, tserver >Reporter: William Berkeley >Assignee: Alexey Serbin >Priority: Major > Labels: performance, scalability > Fix For: 1.13.0 > > > Here's stacks illustrating the phenomenon: > {noformat} > tids=[2201] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb4e68e kudu::consensus::Peer::SignalRequest() > 0xb9c0df kudu::consensus::PeerManager::SignalRequest() > 0xb8c178 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4515] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex() > 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask() > 0xb54058 > _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22185,22194,22193,22188,22187,22186] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb8bff8 > kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm() > 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync() > 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite() > 0x92812d kudu::tserver::TabletServiceImpl::Write() >0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22192,22191] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() >0x1e13dec kudu::rpc::ResultTracker::TrackRpc() >0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4426] > 0x379ba0f710 >0x206d3d0 >0x212fd25 google::protobuf::Message::SpaceUsedLong() >0x211dee4 > google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong() > 0xb6658e kudu::consensus::LogCache::AppendOperations() > 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations() > 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation() > 0xb7c675 > kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked() > 0xb8c147 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > {noformat} > {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to > take the lock to check the term and the Raft role. When many RPCs come in for > the same tablet, the contention can hog service threads and cause queue > overflows on busy systems. > Yugabyte switched their equivalent lock to be an atomic that allows them to > read the term and role wait-free. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows
[ https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2727: Component/s: (was: perf) tserver consensus > Contention on the Raft consensus lock can cause tablet service queue overflows > -- > > Key: KUDU-2727 > URL: https://issues.apache.org/jira/browse/KUDU-2727 > Project: Kudu > Issue Type: Improvement > Components: consensus, tserver >Reporter: William Berkeley >Assignee: Alexey Serbin >Priority: Major > Labels: performance, scalability > > Here's stacks illustrating the phenomenon: > {noformat} > tids=[2201] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb4e68e kudu::consensus::Peer::SignalRequest() > 0xb9c0df kudu::consensus::PeerManager::SignalRequest() > 0xb8c178 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4515] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex() > 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask() > 0xb54058 > _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22185,22194,22193,22188,22187,22186] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb8bff8 > kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm() > 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync() > 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite() > 0x92812d kudu::tserver::TabletServiceImpl::Write() >0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22192,22191] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() >0x1e13dec kudu::rpc::ResultTracker::TrackRpc() >0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4426] > 0x379ba0f710 >0x206d3d0 >0x212fd25 google::protobuf::Message::SpaceUsedLong() >0x211dee4 > google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong() > 0xb6658e kudu::consensus::LogCache::AppendOperations() > 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations() > 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation() > 0xb7c675 > kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked() > 0xb8c147 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > {noformat} > {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to > take the lock to check the term and the Raft role. When many RPCs come in for > the same tablet, the contention can hog service threads and cause queue > overflows on busy systems. > Yugabyte switched their equivalent lock to be an atomic that allows them to > read the term and role wait-free. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows
[ https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2727: Status: In Review (was: In Progress) > Contention on the Raft consensus lock can cause tablet service queue overflows > -- > > Key: KUDU-2727 > URL: https://issues.apache.org/jira/browse/KUDU-2727 > Project: Kudu > Issue Type: Improvement > Components: perf >Reporter: William Berkeley >Assignee: Alexey Serbin >Priority: Major > Labels: performance, scalability > > Here's stacks illustrating the phenomenon: > {noformat} > tids=[2201] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb4e68e kudu::consensus::Peer::SignalRequest() > 0xb9c0df kudu::consensus::PeerManager::SignalRequest() > 0xb8c178 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4515] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex() > 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask() > 0xb54058 > _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22185,22194,22193,22188,22187,22186] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb8bff8 > kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm() > 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync() > 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite() > 0x92812d kudu::tserver::TabletServiceImpl::Write() >0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22192,22191] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() >0x1e13dec kudu::rpc::ResultTracker::TrackRpc() >0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4426] > 0x379ba0f710 >0x206d3d0 >0x212fd25 google::protobuf::Message::SpaceUsedLong() >0x211dee4 > google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong() > 0xb6658e kudu::consensus::LogCache::AppendOperations() > 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations() > 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation() > 0xb7c675 > kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked() > 0xb8c147 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > {noformat} > {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to > take the lock to check the term and the Raft role. When many RPCs come in for > the same tablet, the contention can hog service threads and cause queue > overflows on busy systems. > Yugabyte switched their equivalent lock to be an atomic that allows them to > read the term and role wait-free. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows
[ https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2727: Code Review: https://gerrit.cloudera.org/#/c/16034/ > Contention on the Raft consensus lock can cause tablet service queue overflows > -- > > Key: KUDU-2727 > URL: https://issues.apache.org/jira/browse/KUDU-2727 > Project: Kudu > Issue Type: Improvement > Components: perf >Reporter: William Berkeley >Assignee: Alexey Serbin >Priority: Major > Labels: performance, scalability > > Here's stacks illustrating the phenomenon: > {noformat} > tids=[2201] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb4e68e kudu::consensus::Peer::SignalRequest() > 0xb9c0df kudu::consensus::PeerManager::SignalRequest() > 0xb8c178 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4515] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex() > 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask() > 0xb54058 > _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22185,22194,22193,22188,22187,22186] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb8bff8 > kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm() > 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync() > 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite() > 0x92812d kudu::tserver::TabletServiceImpl::Write() >0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22192,22191] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() >0x1e13dec kudu::rpc::ResultTracker::TrackRpc() >0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4426] > 0x379ba0f710 >0x206d3d0 >0x212fd25 google::protobuf::Message::SpaceUsedLong() >0x211dee4 > google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong() > 0xb6658e kudu::consensus::LogCache::AppendOperations() > 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations() > 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation() > 0xb7c675 > kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked() > 0xb8c147 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > {noformat} > {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to > take the lock to check the term and the Raft role. When many RPCs come in for > the same tablet, the contention can hog service threads and cause queue > overflows on busy systems. > Yugabyte switched their equivalent lock to be an atomic that allows them to > read the term and role wait-free. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows
[ https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2727: Labels: performance scalability (was: ) > Contention on the Raft consensus lock can cause tablet service queue overflows > -- > > Key: KUDU-2727 > URL: https://issues.apache.org/jira/browse/KUDU-2727 > Project: Kudu > Issue Type: Improvement > Components: perf >Reporter: William Berkeley >Priority: Major > Labels: performance, scalability > > Here's stacks illustrating the phenomenon: > {noformat} > tids=[2201] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb4e68e kudu::consensus::Peer::SignalRequest() > 0xb9c0df kudu::consensus::PeerManager::SignalRequest() > 0xb8c178 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4515] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex() > 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask() > 0xb54058 > _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22185,22194,22193,22188,22187,22186] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb8bff8 > kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm() > 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync() > 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite() > 0x92812d kudu::tserver::TabletServiceImpl::Write() >0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22192,22191] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() >0x1e13dec kudu::rpc::ResultTracker::TrackRpc() >0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4426] > 0x379ba0f710 >0x206d3d0 >0x212fd25 google::protobuf::Message::SpaceUsedLong() >0x211dee4 > google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong() > 0xb6658e kudu::consensus::LogCache::AppendOperations() > 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations() > 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation() > 0xb7c675 > kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked() > 0xb8c147 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > {noformat} > {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to > take the lock to check the term and the Raft role. When many RPCs come in for > the same tablet, the contention can hog service threads and cause queue > overflows on busy systems. > Yugabyte switched their equivalent lock to be an atomic that allows them to > read the term and role wait-free. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows
[ https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Henke updated KUDU-2727: -- Component/s: perf > Contention on the Raft consensus lock can cause tablet service queue overflows > -- > > Key: KUDU-2727 > URL: https://issues.apache.org/jira/browse/KUDU-2727 > Project: Kudu > Issue Type: Improvement > Components: perf >Reporter: William Berkeley >Assignee: Mike Percy >Priority: Major > > Here's stacks illustrating the phenomenon: > {noformat} > tids=[2201] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb4e68e kudu::consensus::Peer::SignalRequest() > 0xb9c0df kudu::consensus::PeerManager::SignalRequest() > 0xb8c178 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4515] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex() > 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask() > 0xb54058 > _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22185,22194,22193,22188,22187,22186] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb8bff8 > kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm() > 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync() > 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite() > 0x92812d kudu::tserver::TabletServiceImpl::Write() >0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22192,22191] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() >0x1e13dec kudu::rpc::ResultTracker::TrackRpc() >0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4426] > 0x379ba0f710 >0x206d3d0 >0x212fd25 google::protobuf::Message::SpaceUsedLong() >0x211dee4 > google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong() > 0xb6658e kudu::consensus::LogCache::AppendOperations() > 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations() > 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation() > 0xb7c675 > kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked() > 0xb8c147 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > {noformat} > {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to > take the lock to check the term and the Raft role. When many RPCs come in for > the same tablet, the contention can hog service threads and cause queue > overflows on busy systems. > Yugabyte switched their equivalent lock to be an atomic that allows them to > read the term and role wait-free. -- This message was sent by Atlassian Jira (v8.3.4#803005)