The root cause of the issue is a bit nuanced and it boils down to the fact that the consensus metadata doesn't always get fsynced, and a hard shut down can thus lead to the posted behavior. This comment <https://issues.apache.org/jira/browse/KUDU-2195?focusedCommentId=16328129&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16328129> has all the details on why this is happening. A bandaid solution to this is to use the `kudu remote_replica copy` tool to copy the remaining healthy tablet replica (at node105) to the currently failed ones (node104,106).
Todd posted a patch for this, further down in the link; unfortunately it hasn't landed on the master branch and AFAIK there isn't another fix at the moment. On Mon, Jun 4, 2018 at 6:59 PM, [email protected] < [email protected]> wrote: > > hello!how do you do! i am come from china, i had a problem for the use of > cloudera kudu! > > i had met this issue many times,i dont know what the exact reason for this > issue. but every time i met this issue is this situation: when some master > and tserver service started failure with the ntp unsync problem at the > first start, and i restart the master and tserver when the ntp is sync,but > i will met this issue! the flowing is the log for the command "kudu cluster > ksck cluster1:7051,cluster2:7051,cluster3:7051": > > Tablet 147962d1afa0419bbda19e849ee210ee of table 'my_first_table' is > unavailable: 2 replica(s) not RUNNING > 052adf65aa5e465c86318732b3a9fcc2 (node104:7050): bad state > State: FAILED > Data state: TABLET_DATA_READY > Last status: Incomplete: Unable to load consensus metadata for tablet > 147962d1afa0419bbda19e849ee210ee: Could not read header for proto container > file /var/lib/kudu/tserver/consensus-meta/147962d1afa0419bbda19e849ee210ee: > File size not large enough to be valid: Proto container file > /var/lib/kudu/tserver/consensus-meta/147962d1afa0419bbda19e849ee210ee: Tried > to read 16 bytes at offset 0 but file size is only 0 bytes > 2123398e90bc4373a7429b4caa014dc7 (node106:7050): bad state > State: FAILED > Data state: TABLET_DATA_READY > Last status: Incomplete: Unable to load consensus metadata for tablet > 147962d1afa0419bbda19e849ee210ee: Could not read header for proto container > file /var/lib/kudu/tserver/consensus-meta/147962d1afa0419bbda19e849ee210ee: > File size not large enough to be valid: Proto container file > /var/lib/kudu/tserver/consensus-meta/147962d1afa0419bbda19e849ee210ee: Tried > to read 16 bytes at offset 0 but file size is only 0 bytes > d444b36807624acd96264eac11dd99fc (node105:7050): RUNNING [LEADER] > Table my_first_table has 1 unavailable tablet(s) > Table Summary > Name | Status | Total Tablets | Healthy | Under-replicated | > Unavailable > ----------------+-------------+---------------+---------+------------------+------------- > my_first_table | UNAVAILABLE | 16 | 15 | 0 | 1 > ================== > Errors: > ================== > table consistency check error: Corruption: 1 out of 1 table(s) are bad > > > the version i used is > "kudu-master-1.5.0+cdh5.13.0+0-1.cdh5.13.0.p0.34.el7.x86_64" > and "kudu-tserver-1.5.0+cdh5.13.0+0-1.cdh5.13.0.p0.34.el7.x86_64" > > can you tell me the reason for the issue and what can i do for this issue > again ? > ------------------------------ > [email protected] >
