Grant Henke created KUDU-3275: --------------------------------- Summary: RaftConsensusITest.TestSlowFollower is Flaky Key: KUDU-3275 URL: https://issues.apache.org/jira/browse/KUDU-3275 Project: Kudu Issue Type: Bug Affects Versions: 1.14.0 Reporter: Grant Henke
I have seen RaftConsensusITest.TestSlowFollower fail quite a few times with the following: {code:java} I0411 07:15:34.772778 3404 raft_consensus.cc:1200] T a82c506675e74d62a75e53acd05e86e8 P 8110dc17824943a68d3051e0df96ae5d [term 1 FOLLOWER]: Deduplicated request from leader. Original: 1.18320->[1.18321-1.18746] Dedup: 1.18746->[]I0411 07:15:34.772778 3404 raft_consensus.cc:1200] T a82c506675e74d62a75e53acd05e86e8 P 8110dc17824943a68d3051e0df96ae5d [term 1 FOLLOWER]: Deduplicated request from leader. Original: 1.18320->[1.18321-1.18746] Dedup: 1.18746->[]W0411 07:15:34.798368 3632 log.cc:502] T a82c506675e74d62a75e53acd05e86e8 P 8110dc17824943a68d3051e0df96ae5d: Injecting 1096ms of latency in SegmentAllocator::Sync()W0411 07:15:34.834003 3206 consensus_peers.cc:480] T a82c506675e74d62a75e53acd05e86e8 P 5698d4de566c4adbae64b6f234c0561d -> Peer 8110dc17824943a68d3051e0df96ae5d (127.28.159.66:42271): Couldn't send request to peer 8110dc17824943a68d3051e0df96ae5d. Status: Timed out: UpdateConsensus RPC to 127.28.159.66:42271 timed out after 0.050s (SENT). This is attempt 1: this message will repeat every 5th retry.W0411 07:15:35.333277 3366 tablet_service.cc:2874] Error setting up scanner with request: Service unavailable: Timed out: could not wait for desired snapshot timestamp to be consistent: Tablet is lagging too much to be able to serve snapshot scan. Lagging by: 45366 ms, (max is 30000 ms):: new_scan_request { tablet_id: "a82c506675e74d62a75e53acd05e86e8" projected_columns { name: "key" type: INT32 is_key: true is_nullable: false } projected_columns { name: "int_val" type: INT32 is_key: false is_nullable: false } projected_columns { name: "string_val" type: STRING is_key: false is_nullable: true } read_mode: READ_AT_SNAPSHOT propagated_timestamp: 6627841138429693952 cache_blocks: true order_mode: ORDERED row_format_flags: 0 authz_token { token_data: "\010\210\306\312\203\006\"3\n\005slave\022*\n 4b6d5a12476d4dfcb5a4ab8a6f1190cb\020\001\030\001 \001(\001" signature: "<redacted>""\036\312x4\323\300\350\177k\357\226a<?\375\366\272\373\220\211\031p\036@9,\364j\2608q\031\331\270\355\214\373k8tT\224\3666\000RM\033UbQ\316\247\236#\364\302VW\302\026fJ\'" signing_key_seq_num: 0 } } call_seq_id: 0W0411 07:15:35.635779 3366 tablet_service.cc:2874] Error setting up scanner with request: Service unavailable: Timed out: could not wait for desired snapshot timestamp to be consistent: Tablet is lagging too much to be able to serve snapshot scan. Lagging by: 45668 ms, (max is 30000 ms):: new_scan_request { tablet_id: "a82c506675e74d62a75e53acd05e86e8" projected_columns { name: "key" type: INT32 is_key: true is_nullable: false } projected_columns { name: "int_val" type: INT32 is_key: false is_nullable: false } projected_columns { name: "string_val" type: STRING is_key: false is_nullable: true } read_mode: READ_AT_SNAPSHOT propagated_timestamp: 6627841138425311232 cache_blocks: true order_mode: ORDERED row_format_flags: 0 authz_token { token_data: "\010\210\306\312\203\006\"3\n\005slave\022*\n 4b6d5a12476d4dfcb5a4ab8a6f1190cb\020\001\030\001 \001(\001" signature: "<redacted>""\036\312x4\323\300\350\177k\357\226a<?\375\366\272\373\220\211\031p\036@9,\364j\2608q\031\331\270\355\214\373k8tT\224\3666\000RM\033UbQ\316\247\236#\364\302VW\302\026fJ\'" signing_key_seq_num: 0 } } call_seq_id: 0W0411 07:15:35.636209 3626 scanner-internal.cc:406] Time spent opening tablet: real 57.698s user 0.002s sys 0.000sF0411 07:15:35.636267 3626 test_workload.cc:255] Check failed: _s.ok() Bad status: Timed out: exceeded configured scan timeout of 60.000s: after 20 scan attempts: unable to retry before timeout: Remote error: Service unavailable: Timed out: could not wait for desired snapshot timestamp to be consistent: Timed out waiting for ts: P: 1618125277942117 usec, L: 0 to be safe (mode: NON-LEADER). Current safe time: P: 1618125277252906 usec, L: 0 Physical time difference: 0.689s*** Check failure stack trace: ****** Aborted at 1618125335 (unix time) try "date -d @1618125335" if you are using GNU date ***PC: @ 0x7fe2e72bafb7 gsignal*** SIGABRT (@0x3e80000727d) received by PID 29309 (TID 0x7fe2d979a700) from PID 29309; stack trace: *** @ 0x7fe2e97881f1 google::(anonymous namespace)::FailureSignalHandler() at ??:0 @ 0x7fe2eb20e980 (unknown) at ??:0 @ 0x7fe2e72bafb7 gsignal at ??:0 @ 0x7fe2e72bc921 abort at ??:0 @ 0x7fe2e9778439 google::logging_fail() at ??:0 @ 0x7fe2e977bebd google::LogMessage::Fail() at ??:0 @ 0x7fe2e977dace google::LogMessage::SendToLog() at ??:0 @ 0x7fe2e977ba40 google::LogMessage::Flush() at ??:0 @ 0x7fe2e977e3b9 google::LogMessageFatal::~LogMessageFatal() at ??:0 @ 0x7fe2efae81f0 kudu::TestWorkload::ReadThread() at ??:0 @ 0x7fe2eaa838df execute_native_thread_routine at ??:0 @ 0x7fe2eb2036db start_thread at ??:0 @ 0x7fe2e739d71f clone at ??:0I0411 07:15:35.762290 4692 consensus_queue.cc:778] T a82c506675e74d62a75e53acd05e86e8 P 5698d4de566c4adbae64b6f234c0561d [LEADER]: Peer 8110dc17824943a68d3051e0df96ae5d is lagging by at least 66546 ops behind the committed index [suppressed 8 similar messages]{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)