[jira] [Commented] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "
[ https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16397618#comment-16397618 ] Todd Lipcon commented on KUDU-2342: --- {code} if (s.ok() && peer_pb && peer_pb->member_type() == RaftPeerPB::NON_VOTER && peer_pb->attrs().promote()) { // This peer is ready to promote. // // TODO(mpercy): Should we introduce a function SafeToPromote() that // does the same calculation as SafeToEvict() but for adding a VOTER? NotifyObserversOfPeerToPromote(peer->uuid()); {code} I think Mike's TODO here is relevant. Basically we ended up proposing an uncommittable config change here. > Insert into Lineitem table with 1340 tablets on 129 node cluster failed with > "Failed to write batch " > - > > Key: KUDU-2342 > URL: https://issues.apache.org/jira/browse/KUDU-2342 > Project: Kudu > Issue Type: Bug > Components: tablet >Affects Versions: 1.7.0 >Reporter: Mostafa Mokhtar >Assignee: Alexey Serbin >Priority: Blocker > Labels: scalability > Attachments: Impala query profile.txt, tablet-info.html > > > While loading TPCH 30TB on 129 node cluster via Impala, write operation > failed with : > Query Status: Kudu error(s) reported, first error: Timed out: Failed to > write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 > attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 > (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after > 180.000s (SENT) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "
[ https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16397617#comment-16397617 ] Todd Lipcon commented on KUDU-2342: --- I think being more conservative might be good in general -- eg after any tablet copy completes, include the newly-copied node for some number of seconds/minutes. More directly, though, I think it's bad to promote a node that did not have a successful last communication. > Insert into Lineitem table with 1340 tablets on 129 node cluster failed with > "Failed to write batch " > - > > Key: KUDU-2342 > URL: https://issues.apache.org/jira/browse/KUDU-2342 > Project: Kudu > Issue Type: Bug > Components: tablet >Affects Versions: 1.7.0 >Reporter: Mostafa Mokhtar >Assignee: Alexey Serbin >Priority: Blocker > Labels: scalability > Attachments: Impala query profile.txt, tablet-info.html > > > While loading TPCH 30TB on 129 node cluster via Impala, write operation > failed with : > Query Status: Kudu error(s) reported, first error: Timed out: Failed to > write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 > attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 > (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after > 180.000s (SENT) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "
[ https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16397597#comment-16397597 ] David Alves commented on KUDU-2342: --- Seems like we should be more conservative with the first rule (for voters only) and also add the non-voter which we intend to promote. thoughts? > Insert into Lineitem table with 1340 tablets on 129 node cluster failed with > "Failed to write batch " > - > > Key: KUDU-2342 > URL: https://issues.apache.org/jira/browse/KUDU-2342 > Project: Kudu > Issue Type: Bug > Components: tablet >Affects Versions: 1.7.0 >Reporter: Mostafa Mokhtar >Assignee: Alexey Serbin >Priority: Blocker > Labels: scalability > Attachments: Impala query profile.txt, tablet-info.html > > > While loading TPCH 30TB on 129 node cluster via Impala, write operation > failed with : > Query Status: Kudu error(s) reported, first error: Timed out: Failed to > write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 > attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 > (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after > 180.000s (SENT) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "
[ https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16397594#comment-16397594 ] David Alves commented on KUDU-2342: --- >From what I read of the code, there are two main gc mechanisms: * one only for voters, that makes sure never to gc more than the committed index * one for all peers that is more conservative as it only gcs after everyone has an index, but has an upper bound of 80 In this case we gc'd logs after the tablet copy as if the peer as a non-voter (second rule), meaning the non-voter can't catch up, but then still promoted him to voter, pushing a change config that can never be committed. > Insert into Lineitem table with 1340 tablets on 129 node cluster failed with > "Failed to write batch " > - > > Key: KUDU-2342 > URL: https://issues.apache.org/jira/browse/KUDU-2342 > Project: Kudu > Issue Type: Bug > Components: tablet >Affects Versions: 1.7.0 >Reporter: Mostafa Mokhtar >Assignee: Alexey Serbin >Priority: Blocker > Labels: scalability > Attachments: Impala query profile.txt, tablet-info.html > > > While loading TPCH 30TB on 129 node cluster via Impala, write operation > failed with : > Query Status: Kudu error(s) reported, first error: Timed out: Failed to > write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 > attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 > (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after > 180.000s (SENT) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "
[ https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16397537#comment-16397537 ] Todd Lipcon commented on KUDU-2342: --- For reference, here's the ksck report on this tablet: {code} Tablet b8431200388d486995a4426c88bc06a2 of table 'impala::tpch_3_kudu.lineitem' is under-replicated: 1 replica(s) not RUNNING 14b2404c50b540ae8957adff9a6c7548 (vd1336.halxg.cloudera.com:7050): RUNNING a260dca5a9c846e99cb621881a7b86b8 (vc1515.halxg.cloudera.com:7050): RUNNING [LEADER] e3fdd8da21a643aba21b7acdd6b17499 (va1038.halxg.cloudera.com:7050): TS unavailable f7376c96c6b64e7fa6a7bfc84fd0cd64 (vc1534.halxg.cloudera.com:7050): RUNNING [NONVOTER] 2 replicas' active configs differ from the master's. All the peers reported by the master and tablet servers are: A = 14b2404c50b540ae8957adff9a6c7548 B = a260dca5a9c846e99cb621881a7b86b8 C = e3fdd8da21a643aba21b7acdd6b17499 D = f7376c96c6b64e7fa6a7bfc84fd0cd64 The consensus matrix is: Config source |Replicas| Current term | Config index | Committed? ---++--+--+ master| A B* C D~ | | | Yes A | A B* C D | 1| 1233 | No B | A B* C D | 1| 1233 | No C | [config not available] | | | D | A B* C D~ | 1| 1141 | Yes Table impala::tpch_3_kudu.lineitem has 1 under-replicated tablet(s) {code} It would be nice if ksck could report some info on opid indexes too, but that's a separate improvement. > Insert into Lineitem table with 1340 tablets on 129 node cluster failed with > "Failed to write batch " > - > > Key: KUDU-2342 > URL: https://issues.apache.org/jira/browse/KUDU-2342 > Project: Kudu > Issue Type: Bug > Components: tablet >Affects Versions: 1.7.0 >Reporter: Mostafa Mokhtar >Assignee: Alexey Serbin >Priority: Blocker > Labels: scalability > Attachments: Impala query profile.txt, tablet-info.html > > > While loading TPCH 30TB on 129 node cluster via Impala, write operation > failed with : > Query Status: Kudu error(s) reported, first error: Timed out: Failed to > write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 > attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 > (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after > 180.000s (SENT) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "
[ https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16397522#comment-16397522 ] Todd Lipcon commented on KUDU-2342: --- Reconstructing the timeline a bit: - 07:20:54.751998: peer e3fdd8 fell behind the retention and "can never be caught up" - 07:20:54.766460: peer f7376c added as a NON_VOTER - 07:20:55.268965: tablet copy starts to f7376c - 07:21:34.559736: tablet copy ends - 07:21:34.779841: logs held by the tablet copy session are GCed - 07:21:34.790443: the new NON_VOTER peer is already unable to be caught up because the logs just got GCed (*hmm, interesting*) - 07:21:34.790797: nevertheless, the leader issues a config change to promote f7376c to VOTER Now we have 2/4 VOTER replicas which can never be caught up -- the original bad one, and the one we just promoted. Hence we can't make progress. It seems there are two serious issues at play here: - why did we not retain the logs between the tablet copy session finishing and catching up the peer? perhaps because the non-voter isn't included in the log retention calculations and was more than 80 segments behind? - why did we promote a non-voter that wasn't relatively up to date or in a "good" state? > Insert into Lineitem table with 1340 tablets on 129 node cluster failed with > "Failed to write batch " > - > > Key: KUDU-2342 > URL: https://issues.apache.org/jira/browse/KUDU-2342 > Project: Kudu > Issue Type: Bug > Components: tablet >Affects Versions: 1.7.0 >Reporter: Mostafa Mokhtar >Assignee: Alexey Serbin >Priority: Critical > Labels: scalability > Attachments: Impala query profile.txt, tablet-info.html > > > While loading TPCH 30TB on 129 node cluster via Impala, write operation > failed with : > Query Status: Kudu error(s) reported, first error: Timed out: Failed to > write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 > attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 > (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after > 180.000s (SENT) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "
[ https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16397509#comment-16397509 ] Todd Lipcon commented on KUDU-2342: --- The change config which is pending is: {code} 1.1233@6229814865004195840 REPLICATE CHANGE_CONFIG_OP id { term: 1 index: 1233 } timestamp: 6229814865004195840 op_type: CHANGE_CONFIG_OP change_config_record { tablet_id: "b8431200388d486995a4426c88bc06a2" old_config { opid_index: 1141 OBSOLETE_local: false peers { permanent_uuid: "a260dca5a9c846e99cb621881a7b86b8" member_type: VOTER last_known_addr { host: "vc1515.halxg.cloudera.com" port: 7050 } } peers { permanent_uuid: "e3fdd8da21a643aba21b7acdd6b17499" member_type: VOTER last_known_addr { host: "va1038.halxg.cloudera.com" port: 7050 } } peers { permanent_uuid: "14b2404c50b540ae8957adff9a6c7548" member_type: VOTER last_known_addr { host: "vd1336.halxg.cloudera.com" port: 7050 } } peers { permanent_uuid: "f7376c96c6b64e7fa6a7bfc84fd0cd64" member_type: NON_VOTER last_known_addr { host: "vc1534.halxg.cloudera.com" port: 7050 } attrs { promote: true } } } new_config { opid_index: 1233 OBSOLETE_local: false peers { permanent_uuid: "a260dca5a9c846e99cb621881a7b86b8" member_type: VOTER last_known_addr { host: "vc1515.halxg.cloudera.com" port: 7050 } } peers { permanent_uuid: "e3fdd8da21a643aba21b7acdd6b17499" member_type: VOTER last_known_addr { host: "va1038.halxg.cloudera.com" port: 7050 } } peers { permanent_uuid: "14b2404c50b540ae8957adff9a6c7548" member_type: VOTER last_known_addr { host: "vd1336.halxg.cloudera.com" port: 7050 } } peers { permanent_uuid: "f7376c96c6b64e7fa6a7bfc84fd0cd64" member_type: VOTER last_known_addr { host: "vc1534.halxg.cloudera.com" port: 7050 } attrs { promote: false } } } } {code} That is to say, it has a pending promotion of peer f7376c96c6b64e7fa6a7bfc84fd0cd64 (vc1534) from NON_VOTER to VOTER. > Insert into Lineitem table with 1340 tablets on 129 node cluster failed with > "Failed to write batch " > - > > Key: KUDU-2342 > URL: https://issues.apache.org/jira/browse/KUDU-2342 > Project: Kudu > Issue Type: Bug > Components: tablet >Affects Versions: 1.7.0 >Reporter: Mostafa Mokhtar >Assignee: Alexey Serbin >Priority: Critical > Labels: scalability > Attachments: Impala query profile.txt, tablet-info.html > > > While loading TPCH 30TB on 129 node cluster via Impala, write operation > failed with : > Query Status: Kudu error(s) reported, first error: Timed out: Failed to > write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 > attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 > (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after > 180.000s (SENT) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "
[ https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16397505#comment-16397505 ] Todd Lipcon commented on KUDU-2342: --- It appears what happened is that the leader actaully got 80 segments ahead of the two followers, and since our default log_max_segments_to_retain=80, it GCed the logs anyway. Then it couldn't replicate to either follower and the tablet got stuck. I checked the earliest WAL on that server (wal-01141) and its earliest op is 1.1154. What's a bit odd here is that the leader watermark thinks that 1232 is the committed index and the majority-replicated, but it wants to send ops 1143 and 1055 to the two peers. Also interesting is that it appears this tablet is currently in a configuration with four VOTER replicas. > Insert into Lineitem table with 1340 tablets on 129 node cluster failed with > "Failed to write batch " > - > > Key: KUDU-2342 > URL: https://issues.apache.org/jira/browse/KUDU-2342 > Project: Kudu > Issue Type: Bug > Components: tablet >Affects Versions: 1.7.0 >Reporter: Mostafa Mokhtar >Assignee: Alexey Serbin >Priority: Critical > Labels: scalability > Attachments: Impala query profile.txt, tablet-info.html > > > While loading TPCH 30TB on 129 node cluster via Impala, write operation > failed with : > Query Status: Kudu error(s) reported, first error: Timed out: Failed to > write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 > attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 > (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after > 180.000s (SENT) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "
[ https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16397479#comment-16397479 ] Todd Lipcon commented on KUDU-2342: --- The server vc1515 has the following spewing in its logs: {code} I0313 11:56:27.615651 43703 consensus_peers.cc:230] T b8431200388d486995a4426c88bc06a2 P a260dca5a9c846e99cb621881a7b86b8 -> Peer f7376c96c6b64e7fa6a7bfc84fd0cd64 (vc1534.halxg.cloudera.com:7050): Could not obtain request from queue for peer: f7376c96c6b64e7fa6a7bfc84fd0cd64. Status: Not found: Failed to read ops 1143..1221: Segment 1130 which contained index 1143 has been GCed I0313 11:56:27.973654 43703 consensus_peers.cc:230] T b8431200388d486995a4426c88bc06a2 P a260dca5a9c846e99cb621881a7b86b8 -> Peer e3fdd8da21a643aba21b7acdd6b17499 (va1038.halxg.cloudera.com:7050): Could not obtain request from queue for peer: e3fdd8da21a643aba21b7acdd6b17499. Status: Not found: Failed to read ops 1055..1221: Segment 1043 which contained index 1055 has been GCed {code} in other words, it appears to have evicted the log segments necessary to catch up both of its followers. Thus it's unable to replicate and commit any writes, so the write here timed out. Instead of letting it time out we should of course respond more rapidly saying that the tablet is unavailable, but that's a separate issue. I guess in this case we can't recover because it wont evict a follower either because it knows that it wouldn't be able to commit the config change. So, how did it get into the state where it had GCed logs behind the majority_replicated watermark? [~aserbin] said he can take a look > Insert into Lineitem table with 1340 tablets on 129 node cluster failed with > "Failed to write batch " > - > > Key: KUDU-2342 > URL: https://issues.apache.org/jira/browse/KUDU-2342 > Project: Kudu > Issue Type: Bug > Components: tablet >Affects Versions: 1.7.0 >Reporter: Mostafa Mokhtar >Priority: Major > Labels: scalability > Attachments: Impala query profile.txt > > > While loading TPCH 30TB on 129 node cluster via Impala, write operation > failed with : > Query Status: Kudu error(s) reported, first error: Timed out: Failed to > write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 > attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 > (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after > 180.000s (SENT) -- This message was sent by Atlassian JIRA (v7.6.3#76005)