[jira] [Updated] (KUDU-2947) A replica with slow WAL may grant votes even if established leader is alive and well
[ https://issues.apache.org/jira/browse/KUDU-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2947: Code Review: https://gerrit.cloudera.org/#/c/14260/ Affects Version/s: 0.5.0 0.6.0 0.7.0 0.7.1 0.8.0 0.9.0 0.9.1 0.10.0 1.0.0 1.0.1 1.1.0 1.2.0 1.3.0 1.3.1 1.4.0 1.5.0 1.6.0 1.7.0 1.8.0 1.7.1 1.9.0 1.10.0 > A replica with slow WAL may grant votes even if established leader is alive > and well > > > Key: KUDU-2947 > URL: https://issues.apache.org/jira/browse/KUDU-2947 > Project: Kudu > Issue Type: Bug > Components: consensus, master, tserver >Affects Versions: 0.5.0, 0.6.0, 0.7.0, 0.7.1, 0.8.0, 0.9.0, 0.9.1, 0.10.0, > 1.0.0, 1.0.1, 1.1.0, 1.2.0, 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, > 1.7.1, 1.9.0, 1.10.0 >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Major > > In some cases when WAL operations are slow, a follower replica may grant > {{yes}} votes right after processing recent Raft transactions even if > currently established leader replica is alive and well. Vote requests might > come from so called disruptive replicas in the cluster. The disruptive > replicas might ask for votes of higher Raft term than the current term with > established and well being leader. > In some cases that might lead to multiple successive election rounds even if > there were no actual reason to re-elect leader replicas. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-2947) A replica with slow WAL may grant votes even if established leader is alive and well
[ https://issues.apache.org/jira/browse/KUDU-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2947: Status: In Review (was: Open) > A replica with slow WAL may grant votes even if established leader is alive > and well > > > Key: KUDU-2947 > URL: https://issues.apache.org/jira/browse/KUDU-2947 > Project: Kudu > Issue Type: Bug > Components: consensus, master, tserver >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Major > > In some cases when WAL operations are slow, a follower replica may grant > {{yes}} votes right after processing recent Raft transactions even if > currently established leader replica is alive and well. Vote requests might > come from so called disruptive replicas in the cluster. The disruptive > replicas might ask for votes of higher Raft term than the current term with > established and well being leader. > In some cases that might lead to multiple successive election rounds even if > there were no actual reason to re-elect leader replicas. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-2947) A replica with slow WAL may grant votes even if established leader is alive and well
Alexey Serbin created KUDU-2947: --- Summary: A replica with slow WAL may grant votes even if established leader is alive and well Key: KUDU-2947 URL: https://issues.apache.org/jira/browse/KUDU-2947 Project: Kudu Issue Type: Bug Components: consensus, master, tserver Reporter: Alexey Serbin Assignee: Alexey Serbin In some cases when WAL operations are slow, a follower replica may grant {{yes}} votes right after processing recent Raft transactions even if currently established leader replica is alive and well. Vote requests might come from so called disruptive replicas in the cluster. The disruptive replicas might ask for votes of higher Raft term than the current term with established and well being leader. In some cases that might lead to multiple successive election rounds even if there were no actual reason to re-elect leader replicas. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-2842) Data race in CatalogManager::GetTableLocations
[ https://issues.apache.org/jira/browse/KUDU-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932979#comment-16932979 ] Andrew Wong commented on KUDU-2842: --- I started looking into it more, and it seems like the first approach would be somewhat difficult to do, since the codepaths are shared with GetTabletsLocations, whose call might span multiple tables (and thus, whose TableMetadataLock acquisition may be repeated). I'll see about using string copies. Alternatively, I wonder if we could have the `uuid_to_idx` map use a StringPiece from the strings in TSInfoPBs. Seems like it'd get around both drawbacks you mentioned, but I think it might be a good amount of work, given we currently use ComputeIfAbsent to create the TSInfoPBs. > Data race in CatalogManager::GetTableLocations > -- > > Key: KUDU-2842 > URL: https://issues.apache.org/jira/browse/KUDU-2842 > Project: Kudu > Issue Type: Bug > Components: master >Affects Versions: 1.10.0 >Reporter: Adar Dembo >Priority: Blocker > Attachments: test.txt > > > Saw this on a test run. > I think the problem is that TSInfosDict is reused for all calls to > BuildLocationsForTablet for a particular table, and the map inside it points > to peer UUIDs by reference (i.e. StringPiece) instead of by copy. Thus, when > a given BuildLocationsForTablet call completes, the tablet lock is released > and the catalog manager is free to destroy that tablet's TabletInfo (i.e. via > committing a mutation in ProcessTabletReport). But the very next call to > BuildLocationsForTablet may cause TSInfosDict map keys to be read and > compared, even though the memory backing those keys no longer exists. > Assigning to Will because he committed 586e957f7 but cc'ing [~tlipcon] as he > was the original author of the code. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KUDU-2946) Waiting not allowed when destructing a service pool
[ https://issues.apache.org/jira/browse/KUDU-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adar Dembo resolved KUDU-2946. -- Fix Version/s: 1.11.0 Resolution: Fixed Fixed in commit 49308a1. > Waiting not allowed when destructing a service pool > --- > > Key: KUDU-2946 > URL: https://issues.apache.org/jira/browse/KUDU-2946 > Project: Kudu > Issue Type: Bug > Components: rpc >Reporter: Andrew Wong >Assignee: Adar Dembo >Priority: Major > Fix For: 1.11.0 > > > I have a precommit that failed in TabletServerTest.TestStatus with the > following stack trace: > {code} > W0918 22:12:07.614951 30074 reactor.cc:670] Failed to create an outbound > connection to 255.255.255.255:1 because connect() failed: Network error: > connect(2) error: Network is unreachable (error 101) > W0918 22:12:07.615072 30137 heartbeater.cc:357] Failed 3 heartbeats in a row: > no longer allowing fast heartbeat attempts. > I0918 22:12:07.614991 30138 consensus_queue.cc:206] T > P 4a511a2ea982499e8681b544635aaef9 [LEADER]: > Queue going to LEADER mode. State: All replicated index: 0, Majority > replicated index: 74, Committed index: 74, Last appended: 74.74, Last > appended by leader: 74, Current term: 75, Majority size: 1, State: 0, Mode: > LEADER, active raft config: opid_index: -1 peers { permanent_uuid: > "4a511a2ea982499e8681b544635aaef9" member_type: VOTER last_known_addr { host: > "127.0.0.1" port: 37531 } } > I0918 22:12:07.615285 30141 maintenance_manager.cc:271] Maintenance manager > is disabled. Stopping thread. > I0918 22:12:07.615489 24505 tablet_server.cc:152] > TabletServer@127.0.0.1:37531 shutting down... > F0918 22:12:07.617386 30074 thread_restrictions.cc:79] Check failed: > LoadTLS()->wait_allowed Waiting is not allowed to be used on this thread to > prevent server-wide latency aberrations and deadlocks. Thread 30074 (name: > "rpc reactor", category: "reactor") > *** Check failure stack trace: *** > *** Aborted at 1568844727 (unix time) try "date -d @1568844727" if you are > using GNU date *** > PC: @ 0x7fba67e74c37 gsignal > *** SIGABRT (@0x3e85fb9) received by PID 24505 (TID 0x7fba5f109700) from > PID 24505; stack trace: *** > I0918 22:12:07.626417 24505 ts_tablet_manager.cc:1159] Shutting down tablet > manager... > I0918 22:12:07.626611 24505 tablet_replica.cc:273] T > P 4a511a2ea982499e8681b544635aaef9: stopping > tablet replica > I0918 22:12:07.626811 24505 raft_consensus.cc:2147] T > P 4a511a2ea982499e8681b544635aaef9 [term 75 > LEADER]: Raft consensus shutting down. > I0918 22:12:07.626994 24505 raft_consensus.cc:2174] T > P 4a511a2ea982499e8681b544635aaef9 [term 75 > FOLLOWER]: Raft consensus is shut down! > @ 0x7fba737ed330 (unknown) at ??:0 > @ 0x7fba67e74c37 gsignal at ??:0 > @ 0x7fba67e78028 abort at ??:0 > @ 0x7fba6b94ae09 google::logging_fail() at ??:0 > @ 0x7fba6b94c62d google::LogMessage::Fail() at ??:0 > @ 0x7fba6b94e64c google::LogMessage::SendToLog() at ??:0 > @ 0x7fba6b94c189 google::LogMessage::Flush() at ??:0 > @ 0x7fba6b94efdf google::LogMessageFatal::~LogMessageFatal() at ??:0 > @ 0x7fba6cbdb786 kudu::ThreadRestrictions::AssertWaitAllowed() at ??:0 > @ 0x74059f kudu::CountDownLatch::WaitUntil() at > /home/jenkins-slave/workspace/kudu-master/0/src/kudu/util/countdown_latch.h:81 > @ 0x70c85a kudu::CountDownLatch::WaitFor() at > /home/jenkins-slave/workspace/kudu-master/0/src/kudu/util/countdown_latch.h:94 > @ 0x7fba6cb9bb28 kudu::ThreadJoiner::Join() at ??:0 > @ 0x7fba6fc15cec kudu::rpc::ServicePool::Shutdown() at ??:0 > @ 0x7fba6fc14604 kudu::rpc::ServicePool::~ServicePool() at ??:0 > @ 0x7fba6fc14736 kudu::rpc::ServicePool::~ServicePool() at ??:0 > @ 0x7fba78bd8f9f scoped_refptr<>::~scoped_refptr() at ??:0 > @ 0x7fba6fb4fa56 kudu::rpc::Messenger::QueueInboundCall() at ??:0 > @ 0x7fba6fb1cb5b kudu::rpc::Connection::HandleIncomingCall() at ??:0 > @ 0x7fba6fb1b431 kudu::rpc::Connection::ReadHandler() at ??:0 > @ 0x7fba6ac9e606 ev_invoke_pending at ??:0 > @ 0x7fba6fb8593a kudu::rpc::ReactorThread::InvokePendingCb() at ??:0 > @ 0x7fba6ac9f4f8 ev_run at ??:0 > @ 0x7fba6fb85c89 kudu::rpc::ReactorThread::RunThread() at ??:0 > @ 0x7fba6fb9d503 boost::_bi::bind_t<>::operator()() at ??:0 > @ 0x7fba6fb750fc boost::function0<>::operator()() at ??:0 > @ 0x7fba6cb9e3cb kudu::Thread::SuperviseThread() at ??:0 > @ 0x7fba737e5184 start_thread at ??:0 > @ 0x7fba67f3bffd clone at ??:0 > {code}
[jira] [Commented] (KUDU-2842) Data race in CatalogManager::GetTableLocations
[ https://issues.apache.org/jira/browse/KUDU-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932940#comment-16932940 ] Adar Dembo commented on KUDU-2842: -- Yeah either of those should work. The former will lead to more contention between lookups and operations that commit tablet changes (e.g. processing tablet reports), while the latter will increase the computational and memory footprint of a lookup. We can use the benchmark committed in KUDU-2711 as a guide. > Data race in CatalogManager::GetTableLocations > -- > > Key: KUDU-2842 > URL: https://issues.apache.org/jira/browse/KUDU-2842 > Project: Kudu > Issue Type: Bug > Components: master >Affects Versions: 1.10.0 >Reporter: Adar Dembo >Priority: Blocker > Attachments: test.txt > > > Saw this on a test run. > I think the problem is that TSInfosDict is reused for all calls to > BuildLocationsForTablet for a particular table, and the map inside it points > to peer UUIDs by reference (i.e. StringPiece) instead of by copy. Thus, when > a given BuildLocationsForTablet call completes, the tablet lock is released > and the catalog manager is free to destroy that tablet's TabletInfo (i.e. via > committing a mutation in ProcessTabletReport). But the very next call to > BuildLocationsForTablet may cause TSInfosDict map keys to be read and > compared, even though the memory backing those keys no longer exists. > Assigning to Will because he committed 586e957f7 but cc'ing [~tlipcon] as he > was the original author of the code. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-2842) Data race in CatalogManager::GetTableLocations
[ https://issues.apache.org/jira/browse/KUDU-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932937#comment-16932937 ] Andrew Wong commented on KUDU-2842: --- A couple thoughts, though haven't thought too in depth about it yet. * Could we pass the TabletMetadataLock and TableMetadataLocks around with the TSInfosDict to ensure the locks are held through the duration of the usage of TSInfosDict? * Alternatively, we could do away with StringPiece here and go with copied strings, though we'd be ballooning the memory usage of GetTableLocations quite a bit. > Data race in CatalogManager::GetTableLocations > -- > > Key: KUDU-2842 > URL: https://issues.apache.org/jira/browse/KUDU-2842 > Project: Kudu > Issue Type: Bug > Components: master >Affects Versions: 1.10.0 >Reporter: Adar Dembo >Priority: Blocker > Attachments: test.txt > > > Saw this on a test run. > I think the problem is that TSInfosDict is reused for all calls to > BuildLocationsForTablet for a particular table, and the map inside it points > to peer UUIDs by reference (i.e. StringPiece) instead of by copy. Thus, when > a given BuildLocationsForTablet call completes, the tablet lock is released > and the catalog manager is free to destroy that tablet's TabletInfo (i.e. via > committing a mutation in ProcessTabletReport). But the very next call to > BuildLocationsForTablet may cause TSInfosDict map keys to be read and > compared, even though the memory backing those keys no longer exists. > Assigning to Will because he committed 586e957f7 but cc'ing [~tlipcon] as he > was the original author of the code. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-2946) Waiting not allowed when destructing a service pool
[ https://issues.apache.org/jira/browse/KUDU-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932910#comment-16932910 ] Adar Dembo commented on KUDU-2946: -- Indeed, I didn't realize that the extra long lock acquisition in Messenger::QueueInboundCall was there to avoid a ref inc/dec. By adding the ref inc/dec, the reactor thread (which calls QueueInboundCall) may decrement the last ServicePool ref, which puts it on the hook for destroying the ServicePool, which is a blocking operation. {noformat} diff --git a/src/kudu/rpc/messenger.cc b/src/kudu/rpc/messenger.cc index 41291728c..871192a2e 100644 --- a/src/kudu/rpc/messenger.cc +++ b/src/kudu/rpc/messenger.cc @@ -372,9 +372,7 @@ void Messenger::QueueOutboundCall(const shared_ptr ) { } void Messenger::QueueInboundCall(gscoped_ptr call) { - shared_lock guard(lock_.get_lock()); - scoped_refptr* service = FindOrNull(rpc_services_, - call->remote_method().service_name()); + const auto service = rpc_service(call->remote_method().service_name()); if (PREDICT_FALSE(!service)) { Status s = Status::ServiceUnavailable(Substitute("service $0 not registered on $1", call->remote_method().service_name(), name_)); {noformat} I'll fix this. > Waiting not allowed when destructing a service pool > --- > > Key: KUDU-2946 > URL: https://issues.apache.org/jira/browse/KUDU-2946 > Project: Kudu > Issue Type: Bug > Components: rpc >Reporter: Andrew Wong >Assignee: Adar Dembo >Priority: Major > > I have a precommit that failed in TabletServerTest.TestStatus with the > following stack trace: > {code} > W0918 22:12:07.614951 30074 reactor.cc:670] Failed to create an outbound > connection to 255.255.255.255:1 because connect() failed: Network error: > connect(2) error: Network is unreachable (error 101) > W0918 22:12:07.615072 30137 heartbeater.cc:357] Failed 3 heartbeats in a row: > no longer allowing fast heartbeat attempts. > I0918 22:12:07.614991 30138 consensus_queue.cc:206] T > P 4a511a2ea982499e8681b544635aaef9 [LEADER]: > Queue going to LEADER mode. State: All replicated index: 0, Majority > replicated index: 74, Committed index: 74, Last appended: 74.74, Last > appended by leader: 74, Current term: 75, Majority size: 1, State: 0, Mode: > LEADER, active raft config: opid_index: -1 peers { permanent_uuid: > "4a511a2ea982499e8681b544635aaef9" member_type: VOTER last_known_addr { host: > "127.0.0.1" port: 37531 } } > I0918 22:12:07.615285 30141 maintenance_manager.cc:271] Maintenance manager > is disabled. Stopping thread. > I0918 22:12:07.615489 24505 tablet_server.cc:152] > TabletServer@127.0.0.1:37531 shutting down... > F0918 22:12:07.617386 30074 thread_restrictions.cc:79] Check failed: > LoadTLS()->wait_allowed Waiting is not allowed to be used on this thread to > prevent server-wide latency aberrations and deadlocks. Thread 30074 (name: > "rpc reactor", category: "reactor") > *** Check failure stack trace: *** > *** Aborted at 1568844727 (unix time) try "date -d @1568844727" if you are > using GNU date *** > PC: @ 0x7fba67e74c37 gsignal > *** SIGABRT (@0x3e85fb9) received by PID 24505 (TID 0x7fba5f109700) from > PID 24505; stack trace: *** > I0918 22:12:07.626417 24505 ts_tablet_manager.cc:1159] Shutting down tablet > manager... > I0918 22:12:07.626611 24505 tablet_replica.cc:273] T > P 4a511a2ea982499e8681b544635aaef9: stopping > tablet replica > I0918 22:12:07.626811 24505 raft_consensus.cc:2147] T > P 4a511a2ea982499e8681b544635aaef9 [term 75 > LEADER]: Raft consensus shutting down. > I0918 22:12:07.626994 24505 raft_consensus.cc:2174] T > P 4a511a2ea982499e8681b544635aaef9 [term 75 > FOLLOWER]: Raft consensus is shut down! > @ 0x7fba737ed330 (unknown) at ??:0 > @ 0x7fba67e74c37 gsignal at ??:0 > @ 0x7fba67e78028 abort at ??:0 > @ 0x7fba6b94ae09 google::logging_fail() at ??:0 > @ 0x7fba6b94c62d google::LogMessage::Fail() at ??:0 > @ 0x7fba6b94e64c google::LogMessage::SendToLog() at ??:0 > @ 0x7fba6b94c189 google::LogMessage::Flush() at ??:0 > @ 0x7fba6b94efdf google::LogMessageFatal::~LogMessageFatal() at ??:0 > @ 0x7fba6cbdb786 kudu::ThreadRestrictions::AssertWaitAllowed() at ??:0 > @ 0x74059f kudu::CountDownLatch::WaitUntil() at > /home/jenkins-slave/workspace/kudu-master/0/src/kudu/util/countdown_latch.h:81 > @ 0x70c85a kudu::CountDownLatch::WaitFor() at > /home/jenkins-slave/workspace/kudu-master/0/src/kudu/util/countdown_latch.h:94 > @ 0x7fba6cb9bb28
[jira] [Assigned] (KUDU-2946) Waiting not allowed when destructing a service pool
[ https://issues.apache.org/jira/browse/KUDU-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adar Dembo reassigned KUDU-2946: Assignee: Adar Dembo > Waiting not allowed when destructing a service pool > --- > > Key: KUDU-2946 > URL: https://issues.apache.org/jira/browse/KUDU-2946 > Project: Kudu > Issue Type: Bug > Components: rpc >Reporter: Andrew Wong >Assignee: Adar Dembo >Priority: Major > > I have a precommit that failed in TabletServerTest.TestStatus with the > following stack trace: > {code} > W0918 22:12:07.614951 30074 reactor.cc:670] Failed to create an outbound > connection to 255.255.255.255:1 because connect() failed: Network error: > connect(2) error: Network is unreachable (error 101) > W0918 22:12:07.615072 30137 heartbeater.cc:357] Failed 3 heartbeats in a row: > no longer allowing fast heartbeat attempts. > I0918 22:12:07.614991 30138 consensus_queue.cc:206] T > P 4a511a2ea982499e8681b544635aaef9 [LEADER]: > Queue going to LEADER mode. State: All replicated index: 0, Majority > replicated index: 74, Committed index: 74, Last appended: 74.74, Last > appended by leader: 74, Current term: 75, Majority size: 1, State: 0, Mode: > LEADER, active raft config: opid_index: -1 peers { permanent_uuid: > "4a511a2ea982499e8681b544635aaef9" member_type: VOTER last_known_addr { host: > "127.0.0.1" port: 37531 } } > I0918 22:12:07.615285 30141 maintenance_manager.cc:271] Maintenance manager > is disabled. Stopping thread. > I0918 22:12:07.615489 24505 tablet_server.cc:152] > TabletServer@127.0.0.1:37531 shutting down... > F0918 22:12:07.617386 30074 thread_restrictions.cc:79] Check failed: > LoadTLS()->wait_allowed Waiting is not allowed to be used on this thread to > prevent server-wide latency aberrations and deadlocks. Thread 30074 (name: > "rpc reactor", category: "reactor") > *** Check failure stack trace: *** > *** Aborted at 1568844727 (unix time) try "date -d @1568844727" if you are > using GNU date *** > PC: @ 0x7fba67e74c37 gsignal > *** SIGABRT (@0x3e85fb9) received by PID 24505 (TID 0x7fba5f109700) from > PID 24505; stack trace: *** > I0918 22:12:07.626417 24505 ts_tablet_manager.cc:1159] Shutting down tablet > manager... > I0918 22:12:07.626611 24505 tablet_replica.cc:273] T > P 4a511a2ea982499e8681b544635aaef9: stopping > tablet replica > I0918 22:12:07.626811 24505 raft_consensus.cc:2147] T > P 4a511a2ea982499e8681b544635aaef9 [term 75 > LEADER]: Raft consensus shutting down. > I0918 22:12:07.626994 24505 raft_consensus.cc:2174] T > P 4a511a2ea982499e8681b544635aaef9 [term 75 > FOLLOWER]: Raft consensus is shut down! > @ 0x7fba737ed330 (unknown) at ??:0 > @ 0x7fba67e74c37 gsignal at ??:0 > @ 0x7fba67e78028 abort at ??:0 > @ 0x7fba6b94ae09 google::logging_fail() at ??:0 > @ 0x7fba6b94c62d google::LogMessage::Fail() at ??:0 > @ 0x7fba6b94e64c google::LogMessage::SendToLog() at ??:0 > @ 0x7fba6b94c189 google::LogMessage::Flush() at ??:0 > @ 0x7fba6b94efdf google::LogMessageFatal::~LogMessageFatal() at ??:0 > @ 0x7fba6cbdb786 kudu::ThreadRestrictions::AssertWaitAllowed() at ??:0 > @ 0x74059f kudu::CountDownLatch::WaitUntil() at > /home/jenkins-slave/workspace/kudu-master/0/src/kudu/util/countdown_latch.h:81 > @ 0x70c85a kudu::CountDownLatch::WaitFor() at > /home/jenkins-slave/workspace/kudu-master/0/src/kudu/util/countdown_latch.h:94 > @ 0x7fba6cb9bb28 kudu::ThreadJoiner::Join() at ??:0 > @ 0x7fba6fc15cec kudu::rpc::ServicePool::Shutdown() at ??:0 > @ 0x7fba6fc14604 kudu::rpc::ServicePool::~ServicePool() at ??:0 > @ 0x7fba6fc14736 kudu::rpc::ServicePool::~ServicePool() at ??:0 > @ 0x7fba78bd8f9f scoped_refptr<>::~scoped_refptr() at ??:0 > @ 0x7fba6fb4fa56 kudu::rpc::Messenger::QueueInboundCall() at ??:0 > @ 0x7fba6fb1cb5b kudu::rpc::Connection::HandleIncomingCall() at ??:0 > @ 0x7fba6fb1b431 kudu::rpc::Connection::ReadHandler() at ??:0 > @ 0x7fba6ac9e606 ev_invoke_pending at ??:0 > @ 0x7fba6fb8593a kudu::rpc::ReactorThread::InvokePendingCb() at ??:0 > @ 0x7fba6ac9f4f8 ev_run at ??:0 > @ 0x7fba6fb85c89 kudu::rpc::ReactorThread::RunThread() at ??:0 > @ 0x7fba6fb9d503 boost::_bi::bind_t<>::operator()() at ??:0 > @ 0x7fba6fb750fc boost::function0<>::operator()() at ??:0 > @ 0x7fba6cb9e3cb kudu::Thread::SuperviseThread() at ??:0 > @ 0x7fba737e5184 start_thread at ??:0 > @ 0x7fba67f3bffd clone at ??:0 > {code} > We appear to be waiting on the destruction of the ServicePool. Might be >
[jira] [Created] (KUDU-2946) Waiting not allowed when destructing a service pool
Andrew Wong created KUDU-2946: - Summary: Waiting not allowed when destructing a service pool Key: KUDU-2946 URL: https://issues.apache.org/jira/browse/KUDU-2946 Project: Kudu Issue Type: Bug Components: rpc Reporter: Andrew Wong I have a precommit that failed in TabletServerTest.TestStatus with the following stack trace: {code} W0918 22:12:07.614951 30074 reactor.cc:670] Failed to create an outbound connection to 255.255.255.255:1 because connect() failed: Network error: connect(2) error: Network is unreachable (error 101) W0918 22:12:07.615072 30137 heartbeater.cc:357] Failed 3 heartbeats in a row: no longer allowing fast heartbeat attempts. I0918 22:12:07.614991 30138 consensus_queue.cc:206] T P 4a511a2ea982499e8681b544635aaef9 [LEADER]: Queue going to LEADER mode. State: All replicated index: 0, Majority replicated index: 74, Committed index: 74, Last appended: 74.74, Last appended by leader: 74, Current term: 75, Majority size: 1, State: 0, Mode: LEADER, active raft config: opid_index: -1 peers { permanent_uuid: "4a511a2ea982499e8681b544635aaef9" member_type: VOTER last_known_addr { host: "127.0.0.1" port: 37531 } } I0918 22:12:07.615285 30141 maintenance_manager.cc:271] Maintenance manager is disabled. Stopping thread. I0918 22:12:07.615489 24505 tablet_server.cc:152] TabletServer@127.0.0.1:37531 shutting down... F0918 22:12:07.617386 30074 thread_restrictions.cc:79] Check failed: LoadTLS()->wait_allowed Waiting is not allowed to be used on this thread to prevent server-wide latency aberrations and deadlocks. Thread 30074 (name: "rpc reactor", category: "reactor") *** Check failure stack trace: *** *** Aborted at 1568844727 (unix time) try "date -d @1568844727" if you are using GNU date *** PC: @ 0x7fba67e74c37 gsignal *** SIGABRT (@0x3e85fb9) received by PID 24505 (TID 0x7fba5f109700) from PID 24505; stack trace: *** I0918 22:12:07.626417 24505 ts_tablet_manager.cc:1159] Shutting down tablet manager... I0918 22:12:07.626611 24505 tablet_replica.cc:273] T P 4a511a2ea982499e8681b544635aaef9: stopping tablet replica I0918 22:12:07.626811 24505 raft_consensus.cc:2147] T P 4a511a2ea982499e8681b544635aaef9 [term 75 LEADER]: Raft consensus shutting down. I0918 22:12:07.626994 24505 raft_consensus.cc:2174] T P 4a511a2ea982499e8681b544635aaef9 [term 75 FOLLOWER]: Raft consensus is shut down! @ 0x7fba737ed330 (unknown) at ??:0 @ 0x7fba67e74c37 gsignal at ??:0 @ 0x7fba67e78028 abort at ??:0 @ 0x7fba6b94ae09 google::logging_fail() at ??:0 @ 0x7fba6b94c62d google::LogMessage::Fail() at ??:0 @ 0x7fba6b94e64c google::LogMessage::SendToLog() at ??:0 @ 0x7fba6b94c189 google::LogMessage::Flush() at ??:0 @ 0x7fba6b94efdf google::LogMessageFatal::~LogMessageFatal() at ??:0 @ 0x7fba6cbdb786 kudu::ThreadRestrictions::AssertWaitAllowed() at ??:0 @ 0x74059f kudu::CountDownLatch::WaitUntil() at /home/jenkins-slave/workspace/kudu-master/0/src/kudu/util/countdown_latch.h:81 @ 0x70c85a kudu::CountDownLatch::WaitFor() at /home/jenkins-slave/workspace/kudu-master/0/src/kudu/util/countdown_latch.h:94 @ 0x7fba6cb9bb28 kudu::ThreadJoiner::Join() at ??:0 @ 0x7fba6fc15cec kudu::rpc::ServicePool::Shutdown() at ??:0 @ 0x7fba6fc14604 kudu::rpc::ServicePool::~ServicePool() at ??:0 @ 0x7fba6fc14736 kudu::rpc::ServicePool::~ServicePool() at ??:0 @ 0x7fba78bd8f9f scoped_refptr<>::~scoped_refptr() at ??:0 @ 0x7fba6fb4fa56 kudu::rpc::Messenger::QueueInboundCall() at ??:0 @ 0x7fba6fb1cb5b kudu::rpc::Connection::HandleIncomingCall() at ??:0 @ 0x7fba6fb1b431 kudu::rpc::Connection::ReadHandler() at ??:0 @ 0x7fba6ac9e606 ev_invoke_pending at ??:0 @ 0x7fba6fb8593a kudu::rpc::ReactorThread::InvokePendingCb() at ??:0 @ 0x7fba6ac9f4f8 ev_run at ??:0 @ 0x7fba6fb85c89 kudu::rpc::ReactorThread::RunThread() at ??:0 @ 0x7fba6fb9d503 boost::_bi::bind_t<>::operator()() at ??:0 @ 0x7fba6fb750fc boost::function0<>::operator()() at ??:0 @ 0x7fba6cb9e3cb kudu::Thread::SuperviseThread() at ??:0 @ 0x7fba737e5184 start_thread at ??:0 @ 0x7fba67f3bffd clone at ??:0 {code} We appear to be waiting on the destruction of the ServicePool. Might be related to https://github.com/apache/kudu/commit/0ecc2c7715505fa6d5a03f8ef967a1a96d4f55d5 which adjusted some locking in the Messenger recently. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KUDU-2945) ksck_remote-test: locking a mutex that has been destroyed in the Master
[ https://issues.apache.org/jira/browse/KUDU-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adar Dembo resolved KUDU-2945. -- Fix Version/s: n/a Resolution: Duplicate > ksck_remote-test: locking a mutex that has been destroyed in the Master > --- > > Key: KUDU-2945 > URL: https://issues.apache.org/jira/browse/KUDU-2945 > Project: Kudu > Issue Type: Bug > Components: master >Reporter: Andrew Wong >Assignee: Andrew Wong >Priority: Major > Fix For: n/a > > Attachments: ksck_remote-test.txt > > > Seems like ProcessFullTabletReport is racing with GetTableLocations. I've > attached the full log of the test. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KUDU-2842) Data race in CatalogManager::GetTableLocations
[ https://issues.apache.org/jira/browse/KUDU-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adar Dembo reassigned KUDU-2842: Assignee: (was: Will Berkeley) > Data race in CatalogManager::GetTableLocations > -- > > Key: KUDU-2842 > URL: https://issues.apache.org/jira/browse/KUDU-2842 > Project: Kudu > Issue Type: Bug > Components: master >Affects Versions: 1.10.0 >Reporter: Adar Dembo >Priority: Blocker > Attachments: test.txt > > > Saw this on a test run. > I think the problem is that TSInfosDict is reused for all calls to > BuildLocationsForTablet for a particular table, and the map inside it points > to peer UUIDs by reference (i.e. StringPiece) instead of by copy. Thus, when > a given BuildLocationsForTablet call completes, the tablet lock is released > and the catalog manager is free to destroy that tablet's TabletInfo (i.e. via > committing a mutation in ProcessTabletReport). But the very next call to > BuildLocationsForTablet may cause TSInfosDict map keys to be read and > compared, even though the memory backing those keys no longer exists. > Assigning to Will because he committed 586e957f7 but cc'ing [~tlipcon] as he > was the original author of the code. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-2945) ksck_remote-test: locking a mutex that has been destroyed in the Master
Andrew Wong created KUDU-2945: - Summary: ksck_remote-test: locking a mutex that has been destroyed in the Master Key: KUDU-2945 URL: https://issues.apache.org/jira/browse/KUDU-2945 Project: Kudu Issue Type: Bug Components: master Reporter: Andrew Wong Assignee: Andrew Wong Attachments: ksck_remote-test.txt Seems like ProcessFullTabletReport is racing with GetTableLocations. I've attached the full log of the test. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KUDU-2934) Bad merge behavior for some metrics
[ https://issues.apache.org/jira/browse/KUDU-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang reassigned KUDU-2934: Assignee: YifanZhang > Bad merge behavior for some metrics > --- > > Key: KUDU-2934 > URL: https://issues.apache.org/jira/browse/KUDU-2934 > Project: Kudu > Issue Type: Bug > Components: metrics >Affects Versions: 1.11.0 >Reporter: Yingchun Lai >Assignee: YifanZhang >Priority: Minor > > We added a feature to merge metrics by commit > fe6e5cc0c9c1573de174d1ce7838b449373ae36e ([metrics] Merge metrics by the same > attribute), for AtomicGauge type metrics, we sum up of merged metrics, this > work for almost all of metrics in Kudu. > But I found a metric that could not be merged like this simply, i.e. > "average_diskrowset_height", because it's a "average" value. -- This message was sent by Atlassian Jira (v8.3.4#803005)