[jira] [Updated] (KUDU-2947) A replica with slow WAL may grant votes even if established leader is alive and well

2019-09-18 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2947:

  Code Review: https://gerrit.cloudera.org/#/c/14260/
Affects Version/s: 0.5.0
   0.6.0
   0.7.0
   0.7.1
   0.8.0
   0.9.0
   0.9.1
   0.10.0
   1.0.0
   1.0.1
   1.1.0
   1.2.0
   1.3.0
   1.3.1
   1.4.0
   1.5.0
   1.6.0
   1.7.0
   1.8.0
   1.7.1
   1.9.0
   1.10.0

> A replica with slow WAL may grant votes even if established leader is alive 
> and well
> 
>
> Key: KUDU-2947
> URL: https://issues.apache.org/jira/browse/KUDU-2947
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus, master, tserver
>Affects Versions: 0.5.0, 0.6.0, 0.7.0, 0.7.1, 0.8.0, 0.9.0, 0.9.1, 0.10.0, 
> 1.0.0, 1.0.1, 1.1.0, 1.2.0, 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 
> 1.7.1, 1.9.0, 1.10.0
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
>
> In some cases when WAL operations are slow, a follower replica may grant 
> {{yes}} votes right after processing recent Raft transactions even if 
> currently established leader replica is alive and well.  Vote requests might 
> come from so called disruptive replicas in the cluster.  The disruptive 
> replicas might ask for votes of higher Raft term than the current term with 
> established and well being leader.
> In some cases that might lead to multiple successive election rounds even if 
> there were no actual reason to re-elect leader replicas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-2947) A replica with slow WAL may grant votes even if established leader is alive and well

2019-09-18 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2947:

Status: In Review  (was: Open)

> A replica with slow WAL may grant votes even if established leader is alive 
> and well
> 
>
> Key: KUDU-2947
> URL: https://issues.apache.org/jira/browse/KUDU-2947
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus, master, tserver
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
>
> In some cases when WAL operations are slow, a follower replica may grant 
> {{yes}} votes right after processing recent Raft transactions even if 
> currently established leader replica is alive and well.  Vote requests might 
> come from so called disruptive replicas in the cluster.  The disruptive 
> replicas might ask for votes of higher Raft term than the current term with 
> established and well being leader.
> In some cases that might lead to multiple successive election rounds even if 
> there were no actual reason to re-elect leader replicas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-2947) A replica with slow WAL may grant votes even if established leader is alive and well

2019-09-18 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-2947:
---

 Summary: A replica with slow WAL may grant votes even if 
established leader is alive and well
 Key: KUDU-2947
 URL: https://issues.apache.org/jira/browse/KUDU-2947
 Project: Kudu
  Issue Type: Bug
  Components: consensus, master, tserver
Reporter: Alexey Serbin
Assignee: Alexey Serbin


In some cases when WAL operations are slow, a follower replica may grant 
{{yes}} votes right after processing recent Raft transactions even if currently 
established leader replica is alive and well.  Vote requests might come from so 
called disruptive replicas in the cluster.  The disruptive replicas might ask 
for votes of higher Raft term than the current term with established and well 
being leader.

In some cases that might lead to multiple successive election rounds even if 
there were no actual reason to re-elect leader replicas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-2842) Data race in CatalogManager::GetTableLocations

2019-09-18 Thread Andrew Wong (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932979#comment-16932979
 ] 

Andrew Wong commented on KUDU-2842:
---

I started looking into it more, and it seems like the first approach would be 
somewhat difficult to do, since the codepaths are shared with 
GetTabletsLocations, whose call might span multiple tables (and thus, whose 
TableMetadataLock acquisition may be repeated).

I'll see about using string copies.

Alternatively, I wonder if we could have the `uuid_to_idx` map use a 
StringPiece from the strings in TSInfoPBs. Seems like it'd get around both 
drawbacks you mentioned, but I think it might be a good amount of work, given 
we currently use ComputeIfAbsent to create the TSInfoPBs.

> Data race in CatalogManager::GetTableLocations
> --
>
> Key: KUDU-2842
> URL: https://issues.apache.org/jira/browse/KUDU-2842
> Project: Kudu
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.10.0
>Reporter: Adar Dembo
>Priority: Blocker
> Attachments: test.txt
>
>
> Saw this on a test run.
> I think the problem is that TSInfosDict is reused for all calls to 
> BuildLocationsForTablet for a particular table, and the map inside it points 
> to peer UUIDs by reference (i.e. StringPiece) instead of by copy. Thus, when 
> a given BuildLocationsForTablet call completes, the tablet lock is released 
> and the catalog manager is free to destroy that tablet's TabletInfo (i.e. via 
> committing a mutation in ProcessTabletReport). But the very next call to 
> BuildLocationsForTablet may cause TSInfosDict map keys to be read and 
> compared, even though the memory backing those keys no longer exists.
> Assigning to Will because he committed 586e957f7 but cc'ing [~tlipcon] as he 
> was the original author of the code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KUDU-2946) Waiting not allowed when destructing a service pool

2019-09-18 Thread Adar Dembo (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo resolved KUDU-2946.
--
Fix Version/s: 1.11.0
   Resolution: Fixed

Fixed in commit 49308a1.

> Waiting not allowed when destructing a service pool
> ---
>
> Key: KUDU-2946
> URL: https://issues.apache.org/jira/browse/KUDU-2946
> Project: Kudu
>  Issue Type: Bug
>  Components: rpc
>Reporter: Andrew Wong
>Assignee: Adar Dembo
>Priority: Major
> Fix For: 1.11.0
>
>
> I have a precommit that failed in TabletServerTest.TestStatus with the 
> following stack trace:
> {code}
> W0918 22:12:07.614951 30074 reactor.cc:670] Failed to create an outbound 
> connection to 255.255.255.255:1 because connect() failed: Network error: 
> connect(2) error: Network is unreachable (error 101)
> W0918 22:12:07.615072 30137 heartbeater.cc:357] Failed 3 heartbeats in a row: 
> no longer allowing fast heartbeat attempts.
> I0918 22:12:07.614991 30138 consensus_queue.cc:206] T 
>  P 4a511a2ea982499e8681b544635aaef9 [LEADER]: 
> Queue going to LEADER mode. State: All replicated index: 0, Majority 
> replicated index: 74, Committed index: 74, Last appended: 74.74, Last 
> appended by leader: 74, Current term: 75, Majority size: 1, State: 0, Mode: 
> LEADER, active raft config: opid_index: -1 peers { permanent_uuid: 
> "4a511a2ea982499e8681b544635aaef9" member_type: VOTER last_known_addr { host: 
> "127.0.0.1" port: 37531 } }
> I0918 22:12:07.615285 30141 maintenance_manager.cc:271] Maintenance manager 
> is disabled. Stopping thread.
> I0918 22:12:07.615489 24505 tablet_server.cc:152] 
> TabletServer@127.0.0.1:37531 shutting down...
> F0918 22:12:07.617386 30074 thread_restrictions.cc:79] Check failed: 
> LoadTLS()->wait_allowed Waiting is not allowed to be used on this thread to 
> prevent server-wide latency aberrations and deadlocks. Thread 30074 (name: 
> "rpc reactor", category: "reactor")
> *** Check failure stack trace: ***
> *** Aborted at 1568844727 (unix time) try "date -d @1568844727" if you are 
> using GNU date ***
> PC: @ 0x7fba67e74c37 gsignal
> *** SIGABRT (@0x3e85fb9) received by PID 24505 (TID 0x7fba5f109700) from 
> PID 24505; stack trace: ***
> I0918 22:12:07.626417 24505 ts_tablet_manager.cc:1159] Shutting down tablet 
> manager...
> I0918 22:12:07.626611 24505 tablet_replica.cc:273] T 
>  P 4a511a2ea982499e8681b544635aaef9: stopping 
> tablet replica
> I0918 22:12:07.626811 24505 raft_consensus.cc:2147] T 
>  P 4a511a2ea982499e8681b544635aaef9 [term 75 
> LEADER]: Raft consensus shutting down.
> I0918 22:12:07.626994 24505 raft_consensus.cc:2174] T 
>  P 4a511a2ea982499e8681b544635aaef9 [term 75 
> FOLLOWER]: Raft consensus is shut down!
> @ 0x7fba737ed330 (unknown) at ??:0
> @ 0x7fba67e74c37 gsignal at ??:0
> @ 0x7fba67e78028 abort at ??:0
> @ 0x7fba6b94ae09 google::logging_fail() at ??:0
> @ 0x7fba6b94c62d google::LogMessage::Fail() at ??:0
> @ 0x7fba6b94e64c google::LogMessage::SendToLog() at ??:0
> @ 0x7fba6b94c189 google::LogMessage::Flush() at ??:0
> @ 0x7fba6b94efdf google::LogMessageFatal::~LogMessageFatal() at ??:0
> @ 0x7fba6cbdb786 kudu::ThreadRestrictions::AssertWaitAllowed() at ??:0
> @   0x74059f kudu::CountDownLatch::WaitUntil() at 
> /home/jenkins-slave/workspace/kudu-master/0/src/kudu/util/countdown_latch.h:81
> @   0x70c85a kudu::CountDownLatch::WaitFor() at 
> /home/jenkins-slave/workspace/kudu-master/0/src/kudu/util/countdown_latch.h:94
> @ 0x7fba6cb9bb28 kudu::ThreadJoiner::Join() at ??:0
> @ 0x7fba6fc15cec kudu::rpc::ServicePool::Shutdown() at ??:0
> @ 0x7fba6fc14604 kudu::rpc::ServicePool::~ServicePool() at ??:0
> @ 0x7fba6fc14736 kudu::rpc::ServicePool::~ServicePool() at ??:0
> @ 0x7fba78bd8f9f scoped_refptr<>::~scoped_refptr() at ??:0
> @ 0x7fba6fb4fa56 kudu::rpc::Messenger::QueueInboundCall() at ??:0
> @ 0x7fba6fb1cb5b kudu::rpc::Connection::HandleIncomingCall() at ??:0
> @ 0x7fba6fb1b431 kudu::rpc::Connection::ReadHandler() at ??:0
> @ 0x7fba6ac9e606 ev_invoke_pending at ??:0
> @ 0x7fba6fb8593a kudu::rpc::ReactorThread::InvokePendingCb() at ??:0
> @ 0x7fba6ac9f4f8 ev_run at ??:0
> @ 0x7fba6fb85c89 kudu::rpc::ReactorThread::RunThread() at ??:0
> @ 0x7fba6fb9d503 boost::_bi::bind_t<>::operator()() at ??:0
> @ 0x7fba6fb750fc boost::function0<>::operator()() at ??:0
> @ 0x7fba6cb9e3cb kudu::Thread::SuperviseThread() at ??:0
> @ 0x7fba737e5184 start_thread at ??:0
> @ 0x7fba67f3bffd clone at ??:0
> {code}

[jira] [Commented] (KUDU-2842) Data race in CatalogManager::GetTableLocations

2019-09-18 Thread Adar Dembo (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932940#comment-16932940
 ] 

Adar Dembo commented on KUDU-2842:
--

Yeah either of those should work. The former will lead to more contention 
between lookups and operations that commit tablet changes (e.g. processing 
tablet reports), while the latter will increase the computational and memory 
footprint of a lookup. We can use the benchmark committed in KUDU-2711 as a 
guide.

> Data race in CatalogManager::GetTableLocations
> --
>
> Key: KUDU-2842
> URL: https://issues.apache.org/jira/browse/KUDU-2842
> Project: Kudu
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.10.0
>Reporter: Adar Dembo
>Priority: Blocker
> Attachments: test.txt
>
>
> Saw this on a test run.
> I think the problem is that TSInfosDict is reused for all calls to 
> BuildLocationsForTablet for a particular table, and the map inside it points 
> to peer UUIDs by reference (i.e. StringPiece) instead of by copy. Thus, when 
> a given BuildLocationsForTablet call completes, the tablet lock is released 
> and the catalog manager is free to destroy that tablet's TabletInfo (i.e. via 
> committing a mutation in ProcessTabletReport). But the very next call to 
> BuildLocationsForTablet may cause TSInfosDict map keys to be read and 
> compared, even though the memory backing those keys no longer exists.
> Assigning to Will because he committed 586e957f7 but cc'ing [~tlipcon] as he 
> was the original author of the code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-2842) Data race in CatalogManager::GetTableLocations

2019-09-18 Thread Andrew Wong (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932937#comment-16932937
 ] 

Andrew Wong commented on KUDU-2842:
---

A couple thoughts, though haven't thought too in depth about it yet.

* Could we pass the TabletMetadataLock and TableMetadataLocks around with the 
TSInfosDict to ensure the locks are held through the duration of the usage of 
TSInfosDict?
* Alternatively, we could do away with StringPiece here and go with copied 
strings, though we'd be ballooning the memory usage of GetTableLocations quite 
a bit.

> Data race in CatalogManager::GetTableLocations
> --
>
> Key: KUDU-2842
> URL: https://issues.apache.org/jira/browse/KUDU-2842
> Project: Kudu
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.10.0
>Reporter: Adar Dembo
>Priority: Blocker
> Attachments: test.txt
>
>
> Saw this on a test run.
> I think the problem is that TSInfosDict is reused for all calls to 
> BuildLocationsForTablet for a particular table, and the map inside it points 
> to peer UUIDs by reference (i.e. StringPiece) instead of by copy. Thus, when 
> a given BuildLocationsForTablet call completes, the tablet lock is released 
> and the catalog manager is free to destroy that tablet's TabletInfo (i.e. via 
> committing a mutation in ProcessTabletReport). But the very next call to 
> BuildLocationsForTablet may cause TSInfosDict map keys to be read and 
> compared, even though the memory backing those keys no longer exists.
> Assigning to Will because he committed 586e957f7 but cc'ing [~tlipcon] as he 
> was the original author of the code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-2946) Waiting not allowed when destructing a service pool

2019-09-18 Thread Adar Dembo (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932910#comment-16932910
 ] 

Adar Dembo commented on KUDU-2946:
--

Indeed, I didn't realize that the extra long lock acquisition in 
Messenger::QueueInboundCall was there to avoid a ref inc/dec. By adding the ref 
inc/dec, the reactor thread (which calls QueueInboundCall) may decrement the 
last ServicePool ref, which puts it on the hook for destroying the ServicePool, 
which is a blocking operation.

{noformat}
diff --git a/src/kudu/rpc/messenger.cc b/src/kudu/rpc/messenger.cc
index 41291728c..871192a2e 100644
--- a/src/kudu/rpc/messenger.cc
+++ b/src/kudu/rpc/messenger.cc
@@ -372,9 +372,7 @@ void Messenger::QueueOutboundCall(const 
shared_ptr ) {
 }
 
 void Messenger::QueueInboundCall(gscoped_ptr call) {
-  shared_lock guard(lock_.get_lock());
-  scoped_refptr* service = FindOrNull(rpc_services_,
-  
call->remote_method().service_name());
+  const auto service = rpc_service(call->remote_method().service_name());
   if (PREDICT_FALSE(!service)) {
 Status s =  Status::ServiceUnavailable(Substitute("service $0 not 
registered on $1",
   
call->remote_method().service_name(), name_));
{noformat}

I'll fix this.

> Waiting not allowed when destructing a service pool
> ---
>
> Key: KUDU-2946
> URL: https://issues.apache.org/jira/browse/KUDU-2946
> Project: Kudu
>  Issue Type: Bug
>  Components: rpc
>Reporter: Andrew Wong
>Assignee: Adar Dembo
>Priority: Major
>
> I have a precommit that failed in TabletServerTest.TestStatus with the 
> following stack trace:
> {code}
> W0918 22:12:07.614951 30074 reactor.cc:670] Failed to create an outbound 
> connection to 255.255.255.255:1 because connect() failed: Network error: 
> connect(2) error: Network is unreachable (error 101)
> W0918 22:12:07.615072 30137 heartbeater.cc:357] Failed 3 heartbeats in a row: 
> no longer allowing fast heartbeat attempts.
> I0918 22:12:07.614991 30138 consensus_queue.cc:206] T 
>  P 4a511a2ea982499e8681b544635aaef9 [LEADER]: 
> Queue going to LEADER mode. State: All replicated index: 0, Majority 
> replicated index: 74, Committed index: 74, Last appended: 74.74, Last 
> appended by leader: 74, Current term: 75, Majority size: 1, State: 0, Mode: 
> LEADER, active raft config: opid_index: -1 peers { permanent_uuid: 
> "4a511a2ea982499e8681b544635aaef9" member_type: VOTER last_known_addr { host: 
> "127.0.0.1" port: 37531 } }
> I0918 22:12:07.615285 30141 maintenance_manager.cc:271] Maintenance manager 
> is disabled. Stopping thread.
> I0918 22:12:07.615489 24505 tablet_server.cc:152] 
> TabletServer@127.0.0.1:37531 shutting down...
> F0918 22:12:07.617386 30074 thread_restrictions.cc:79] Check failed: 
> LoadTLS()->wait_allowed Waiting is not allowed to be used on this thread to 
> prevent server-wide latency aberrations and deadlocks. Thread 30074 (name: 
> "rpc reactor", category: "reactor")
> *** Check failure stack trace: ***
> *** Aborted at 1568844727 (unix time) try "date -d @1568844727" if you are 
> using GNU date ***
> PC: @ 0x7fba67e74c37 gsignal
> *** SIGABRT (@0x3e85fb9) received by PID 24505 (TID 0x7fba5f109700) from 
> PID 24505; stack trace: ***
> I0918 22:12:07.626417 24505 ts_tablet_manager.cc:1159] Shutting down tablet 
> manager...
> I0918 22:12:07.626611 24505 tablet_replica.cc:273] T 
>  P 4a511a2ea982499e8681b544635aaef9: stopping 
> tablet replica
> I0918 22:12:07.626811 24505 raft_consensus.cc:2147] T 
>  P 4a511a2ea982499e8681b544635aaef9 [term 75 
> LEADER]: Raft consensus shutting down.
> I0918 22:12:07.626994 24505 raft_consensus.cc:2174] T 
>  P 4a511a2ea982499e8681b544635aaef9 [term 75 
> FOLLOWER]: Raft consensus is shut down!
> @ 0x7fba737ed330 (unknown) at ??:0
> @ 0x7fba67e74c37 gsignal at ??:0
> @ 0x7fba67e78028 abort at ??:0
> @ 0x7fba6b94ae09 google::logging_fail() at ??:0
> @ 0x7fba6b94c62d google::LogMessage::Fail() at ??:0
> @ 0x7fba6b94e64c google::LogMessage::SendToLog() at ??:0
> @ 0x7fba6b94c189 google::LogMessage::Flush() at ??:0
> @ 0x7fba6b94efdf google::LogMessageFatal::~LogMessageFatal() at ??:0
> @ 0x7fba6cbdb786 kudu::ThreadRestrictions::AssertWaitAllowed() at ??:0
> @   0x74059f kudu::CountDownLatch::WaitUntil() at 
> /home/jenkins-slave/workspace/kudu-master/0/src/kudu/util/countdown_latch.h:81
> @   0x70c85a kudu::CountDownLatch::WaitFor() at 
> /home/jenkins-slave/workspace/kudu-master/0/src/kudu/util/countdown_latch.h:94
> @ 0x7fba6cb9bb28 

[jira] [Assigned] (KUDU-2946) Waiting not allowed when destructing a service pool

2019-09-18 Thread Adar Dembo (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo reassigned KUDU-2946:


Assignee: Adar Dembo

> Waiting not allowed when destructing a service pool
> ---
>
> Key: KUDU-2946
> URL: https://issues.apache.org/jira/browse/KUDU-2946
> Project: Kudu
>  Issue Type: Bug
>  Components: rpc
>Reporter: Andrew Wong
>Assignee: Adar Dembo
>Priority: Major
>
> I have a precommit that failed in TabletServerTest.TestStatus with the 
> following stack trace:
> {code}
> W0918 22:12:07.614951 30074 reactor.cc:670] Failed to create an outbound 
> connection to 255.255.255.255:1 because connect() failed: Network error: 
> connect(2) error: Network is unreachable (error 101)
> W0918 22:12:07.615072 30137 heartbeater.cc:357] Failed 3 heartbeats in a row: 
> no longer allowing fast heartbeat attempts.
> I0918 22:12:07.614991 30138 consensus_queue.cc:206] T 
>  P 4a511a2ea982499e8681b544635aaef9 [LEADER]: 
> Queue going to LEADER mode. State: All replicated index: 0, Majority 
> replicated index: 74, Committed index: 74, Last appended: 74.74, Last 
> appended by leader: 74, Current term: 75, Majority size: 1, State: 0, Mode: 
> LEADER, active raft config: opid_index: -1 peers { permanent_uuid: 
> "4a511a2ea982499e8681b544635aaef9" member_type: VOTER last_known_addr { host: 
> "127.0.0.1" port: 37531 } }
> I0918 22:12:07.615285 30141 maintenance_manager.cc:271] Maintenance manager 
> is disabled. Stopping thread.
> I0918 22:12:07.615489 24505 tablet_server.cc:152] 
> TabletServer@127.0.0.1:37531 shutting down...
> F0918 22:12:07.617386 30074 thread_restrictions.cc:79] Check failed: 
> LoadTLS()->wait_allowed Waiting is not allowed to be used on this thread to 
> prevent server-wide latency aberrations and deadlocks. Thread 30074 (name: 
> "rpc reactor", category: "reactor")
> *** Check failure stack trace: ***
> *** Aborted at 1568844727 (unix time) try "date -d @1568844727" if you are 
> using GNU date ***
> PC: @ 0x7fba67e74c37 gsignal
> *** SIGABRT (@0x3e85fb9) received by PID 24505 (TID 0x7fba5f109700) from 
> PID 24505; stack trace: ***
> I0918 22:12:07.626417 24505 ts_tablet_manager.cc:1159] Shutting down tablet 
> manager...
> I0918 22:12:07.626611 24505 tablet_replica.cc:273] T 
>  P 4a511a2ea982499e8681b544635aaef9: stopping 
> tablet replica
> I0918 22:12:07.626811 24505 raft_consensus.cc:2147] T 
>  P 4a511a2ea982499e8681b544635aaef9 [term 75 
> LEADER]: Raft consensus shutting down.
> I0918 22:12:07.626994 24505 raft_consensus.cc:2174] T 
>  P 4a511a2ea982499e8681b544635aaef9 [term 75 
> FOLLOWER]: Raft consensus is shut down!
> @ 0x7fba737ed330 (unknown) at ??:0
> @ 0x7fba67e74c37 gsignal at ??:0
> @ 0x7fba67e78028 abort at ??:0
> @ 0x7fba6b94ae09 google::logging_fail() at ??:0
> @ 0x7fba6b94c62d google::LogMessage::Fail() at ??:0
> @ 0x7fba6b94e64c google::LogMessage::SendToLog() at ??:0
> @ 0x7fba6b94c189 google::LogMessage::Flush() at ??:0
> @ 0x7fba6b94efdf google::LogMessageFatal::~LogMessageFatal() at ??:0
> @ 0x7fba6cbdb786 kudu::ThreadRestrictions::AssertWaitAllowed() at ??:0
> @   0x74059f kudu::CountDownLatch::WaitUntil() at 
> /home/jenkins-slave/workspace/kudu-master/0/src/kudu/util/countdown_latch.h:81
> @   0x70c85a kudu::CountDownLatch::WaitFor() at 
> /home/jenkins-slave/workspace/kudu-master/0/src/kudu/util/countdown_latch.h:94
> @ 0x7fba6cb9bb28 kudu::ThreadJoiner::Join() at ??:0
> @ 0x7fba6fc15cec kudu::rpc::ServicePool::Shutdown() at ??:0
> @ 0x7fba6fc14604 kudu::rpc::ServicePool::~ServicePool() at ??:0
> @ 0x7fba6fc14736 kudu::rpc::ServicePool::~ServicePool() at ??:0
> @ 0x7fba78bd8f9f scoped_refptr<>::~scoped_refptr() at ??:0
> @ 0x7fba6fb4fa56 kudu::rpc::Messenger::QueueInboundCall() at ??:0
> @ 0x7fba6fb1cb5b kudu::rpc::Connection::HandleIncomingCall() at ??:0
> @ 0x7fba6fb1b431 kudu::rpc::Connection::ReadHandler() at ??:0
> @ 0x7fba6ac9e606 ev_invoke_pending at ??:0
> @ 0x7fba6fb8593a kudu::rpc::ReactorThread::InvokePendingCb() at ??:0
> @ 0x7fba6ac9f4f8 ev_run at ??:0
> @ 0x7fba6fb85c89 kudu::rpc::ReactorThread::RunThread() at ??:0
> @ 0x7fba6fb9d503 boost::_bi::bind_t<>::operator()() at ??:0
> @ 0x7fba6fb750fc boost::function0<>::operator()() at ??:0
> @ 0x7fba6cb9e3cb kudu::Thread::SuperviseThread() at ??:0
> @ 0x7fba737e5184 start_thread at ??:0
> @ 0x7fba67f3bffd clone at ??:0
> {code}
> We appear to be waiting on the destruction of the ServicePool. Might be 
> 

[jira] [Created] (KUDU-2946) Waiting not allowed when destructing a service pool

2019-09-18 Thread Andrew Wong (Jira)
Andrew Wong created KUDU-2946:
-

 Summary: Waiting not allowed when destructing a service pool
 Key: KUDU-2946
 URL: https://issues.apache.org/jira/browse/KUDU-2946
 Project: Kudu
  Issue Type: Bug
  Components: rpc
Reporter: Andrew Wong


I have a precommit that failed in TabletServerTest.TestStatus with the 
following stack trace:

{code}
W0918 22:12:07.614951 30074 reactor.cc:670] Failed to create an outbound 
connection to 255.255.255.255:1 because connect() failed: Network error: 
connect(2) error: Network is unreachable (error 101)
W0918 22:12:07.615072 30137 heartbeater.cc:357] Failed 3 heartbeats in a row: 
no longer allowing fast heartbeat attempts.
I0918 22:12:07.614991 30138 consensus_queue.cc:206] T 
 P 4a511a2ea982499e8681b544635aaef9 [LEADER]: 
Queue going to LEADER mode. State: All replicated index: 0, Majority replicated 
index: 74, Committed index: 74, Last appended: 74.74, Last appended by leader: 
74, Current term: 75, Majority size: 1, State: 0, Mode: LEADER, active raft 
config: opid_index: -1 peers { permanent_uuid: 
"4a511a2ea982499e8681b544635aaef9" member_type: VOTER last_known_addr { host: 
"127.0.0.1" port: 37531 } }
I0918 22:12:07.615285 30141 maintenance_manager.cc:271] Maintenance manager is 
disabled. Stopping thread.
I0918 22:12:07.615489 24505 tablet_server.cc:152] TabletServer@127.0.0.1:37531 
shutting down...
F0918 22:12:07.617386 30074 thread_restrictions.cc:79] Check failed: 
LoadTLS()->wait_allowed Waiting is not allowed to be used on this thread to 
prevent server-wide latency aberrations and deadlocks. Thread 30074 (name: "rpc 
reactor", category: "reactor")
*** Check failure stack trace: ***
*** Aborted at 1568844727 (unix time) try "date -d @1568844727" if you are 
using GNU date ***
PC: @ 0x7fba67e74c37 gsignal
*** SIGABRT (@0x3e85fb9) received by PID 24505 (TID 0x7fba5f109700) from 
PID 24505; stack trace: ***
I0918 22:12:07.626417 24505 ts_tablet_manager.cc:1159] Shutting down tablet 
manager...
I0918 22:12:07.626611 24505 tablet_replica.cc:273] T 
 P 4a511a2ea982499e8681b544635aaef9: stopping 
tablet replica
I0918 22:12:07.626811 24505 raft_consensus.cc:2147] T 
 P 4a511a2ea982499e8681b544635aaef9 [term 75 
LEADER]: Raft consensus shutting down.
I0918 22:12:07.626994 24505 raft_consensus.cc:2174] T 
 P 4a511a2ea982499e8681b544635aaef9 [term 75 
FOLLOWER]: Raft consensus is shut down!
@ 0x7fba737ed330 (unknown) at ??:0
@ 0x7fba67e74c37 gsignal at ??:0
@ 0x7fba67e78028 abort at ??:0
@ 0x7fba6b94ae09 google::logging_fail() at ??:0
@ 0x7fba6b94c62d google::LogMessage::Fail() at ??:0
@ 0x7fba6b94e64c google::LogMessage::SendToLog() at ??:0
@ 0x7fba6b94c189 google::LogMessage::Flush() at ??:0
@ 0x7fba6b94efdf google::LogMessageFatal::~LogMessageFatal() at ??:0
@ 0x7fba6cbdb786 kudu::ThreadRestrictions::AssertWaitAllowed() at ??:0
@   0x74059f kudu::CountDownLatch::WaitUntil() at 
/home/jenkins-slave/workspace/kudu-master/0/src/kudu/util/countdown_latch.h:81
@   0x70c85a kudu::CountDownLatch::WaitFor() at 
/home/jenkins-slave/workspace/kudu-master/0/src/kudu/util/countdown_latch.h:94
@ 0x7fba6cb9bb28 kudu::ThreadJoiner::Join() at ??:0
@ 0x7fba6fc15cec kudu::rpc::ServicePool::Shutdown() at ??:0
@ 0x7fba6fc14604 kudu::rpc::ServicePool::~ServicePool() at ??:0
@ 0x7fba6fc14736 kudu::rpc::ServicePool::~ServicePool() at ??:0
@ 0x7fba78bd8f9f scoped_refptr<>::~scoped_refptr() at ??:0
@ 0x7fba6fb4fa56 kudu::rpc::Messenger::QueueInboundCall() at ??:0
@ 0x7fba6fb1cb5b kudu::rpc::Connection::HandleIncomingCall() at ??:0
@ 0x7fba6fb1b431 kudu::rpc::Connection::ReadHandler() at ??:0
@ 0x7fba6ac9e606 ev_invoke_pending at ??:0
@ 0x7fba6fb8593a kudu::rpc::ReactorThread::InvokePendingCb() at ??:0
@ 0x7fba6ac9f4f8 ev_run at ??:0
@ 0x7fba6fb85c89 kudu::rpc::ReactorThread::RunThread() at ??:0
@ 0x7fba6fb9d503 boost::_bi::bind_t<>::operator()() at ??:0
@ 0x7fba6fb750fc boost::function0<>::operator()() at ??:0
@ 0x7fba6cb9e3cb kudu::Thread::SuperviseThread() at ??:0
@ 0x7fba737e5184 start_thread at ??:0
@ 0x7fba67f3bffd clone at ??:0
{code}

We appear to be waiting on the destruction of the ServicePool. Might be related 
to 
https://github.com/apache/kudu/commit/0ecc2c7715505fa6d5a03f8ef967a1a96d4f55d5 
which adjusted some locking in the Messenger recently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KUDU-2945) ksck_remote-test: locking a mutex that has been destroyed in the Master

2019-09-18 Thread Adar Dembo (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo resolved KUDU-2945.
--
Fix Version/s: n/a
   Resolution: Duplicate

> ksck_remote-test: locking a mutex that has been destroyed in the Master
> ---
>
> Key: KUDU-2945
> URL: https://issues.apache.org/jira/browse/KUDU-2945
> Project: Kudu
>  Issue Type: Bug
>  Components: master
>Reporter: Andrew Wong
>Assignee: Andrew Wong
>Priority: Major
> Fix For: n/a
>
> Attachments: ksck_remote-test.txt
>
>
> Seems like ProcessFullTabletReport is racing with GetTableLocations. I've 
> attached the full log of the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KUDU-2842) Data race in CatalogManager::GetTableLocations

2019-09-18 Thread Adar Dembo (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo reassigned KUDU-2842:


Assignee: (was: Will Berkeley)

> Data race in CatalogManager::GetTableLocations
> --
>
> Key: KUDU-2842
> URL: https://issues.apache.org/jira/browse/KUDU-2842
> Project: Kudu
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.10.0
>Reporter: Adar Dembo
>Priority: Blocker
> Attachments: test.txt
>
>
> Saw this on a test run.
> I think the problem is that TSInfosDict is reused for all calls to 
> BuildLocationsForTablet for a particular table, and the map inside it points 
> to peer UUIDs by reference (i.e. StringPiece) instead of by copy. Thus, when 
> a given BuildLocationsForTablet call completes, the tablet lock is released 
> and the catalog manager is free to destroy that tablet's TabletInfo (i.e. via 
> committing a mutation in ProcessTabletReport). But the very next call to 
> BuildLocationsForTablet may cause TSInfosDict map keys to be read and 
> compared, even though the memory backing those keys no longer exists.
> Assigning to Will because he committed 586e957f7 but cc'ing [~tlipcon] as he 
> was the original author of the code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-2945) ksck_remote-test: locking a mutex that has been destroyed in the Master

2019-09-18 Thread Andrew Wong (Jira)
Andrew Wong created KUDU-2945:
-

 Summary: ksck_remote-test: locking a mutex that has been destroyed 
in the Master
 Key: KUDU-2945
 URL: https://issues.apache.org/jira/browse/KUDU-2945
 Project: Kudu
  Issue Type: Bug
  Components: master
Reporter: Andrew Wong
Assignee: Andrew Wong
 Attachments: ksck_remote-test.txt

Seems like ProcessFullTabletReport is racing with GetTableLocations. I've 
attached the full log of the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KUDU-2934) Bad merge behavior for some metrics

2019-09-18 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang reassigned KUDU-2934:


Assignee: YifanZhang

> Bad merge behavior for some metrics
> ---
>
> Key: KUDU-2934
> URL: https://issues.apache.org/jira/browse/KUDU-2934
> Project: Kudu
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 1.11.0
>Reporter: Yingchun Lai
>Assignee: YifanZhang
>Priority: Minor
>
> We added a feature to merge metrics by commit 
> fe6e5cc0c9c1573de174d1ce7838b449373ae36e ([metrics] Merge metrics by the same 
> attribute), for AtomicGauge type metrics, we sum up of merged metrics, this 
> work for almost all of metrics in Kudu.
> But I found a metric that could not be merged like this simply, i.e. 
> "average_diskrowset_height", because it's a "average" value.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)