[ 
https://issues.apache.org/jira/browse/KUDU-2469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16508309#comment-16508309
 ] 

Andrew Wong edited comment on KUDU-2469 at 6/11/18 4:25 PM:
------------------------------------------------------------

The difficulty with failing a specific tablet from a CFile error is that 
CFileReader (the component that yields the checksum error) is unaware of the 
tablet to which it belongs.

Plumbing the tablet id to the CFiles seems excessive considering how many 
CFiles we might expect in a tablet server. Alternatively, we might want to 
audit of the current usages of CFileReader::Init() (which is where the checksum 
currently fails) and catch these errors at the tablet layer, where the tablet 
id is known.

Another approach might attempt to trigger the Fs::ReadableBlock's (or its 
underlying log block container's) disk error handling when returning with a 
CFile checksum error. Given what's currently in place, this would fail all 
tablets configured to stripe data across the directory in which the block 
resides, which is much coarser grained than the behavior described in the Jira.


was (Author: andrew.wong):
The difficulty with failing a specific tablet from a CFile error is that 
CFileReaders (the component that yields the checksum error) is unaware of the 
tablet to which it belongs.

Plumbing the tablet id to the CFiles seems excessive considering how many 
CFiles we might expect in a tablet server.

We might want to audit of the current usages of CFileReader::Init() and catch 
these errors at the tablet layer.

Another approach might attempt to trigger the Fs::ReadableBlock's (or its 
underlying log block container's) disk error handling when returning with a 
CFile checksum error. Given what's currently in place, this would fail all 
tablets configured to stripe data across the directory in which the block 
resides, which is much coarser grained than the behavior described in the Jira.

> Handle CFile checksum failures
> ------------------------------
>
>                 Key: KUDU-2469
>                 URL: https://issues.apache.org/jira/browse/KUDU-2469
>             Project: Kudu
>          Issue Type: Improvement
>          Components: cfile, tablet
>            Reporter: Andrew Wong
>            Priority: Major
>
> Today, there is no special handling for CFile checksum failures, other than 
> returning an error. It would be nice if the behavior for such a failure 
> marked the tablet as "failed": making it unavailable for reads, marking it 
> for eviction/re-replication, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to