[ 
https://issues.apache.org/jira/browse/YARN-3797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589070#comment-14589070
 ] 

Allen Wittenauer commented on YARN-3797:
----------------------------------------

This is the type of problem where one would use the node health check script. 

> NodeManager not blacklisting the disk (shuffle) with errors
> -----------------------------------------------------------
>
>                 Key: YARN-3797
>                 URL: https://issues.apache.org/jira/browse/YARN-3797
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Rajesh Balamohan
>
> In a multi-node environment, one of the disk (where map outputs are written) 
> in a node went bad. Errors are given below.
> {noformat}
> Info fld=0x9ad090a
> sd 6:0:5:0: [sdf]  Add. Sense: Unrecovered read error
> sd 6:0:5:0: [sdf] CDB: Read(10): 28 00 09 ad 09 08 00 00 08 00
> end_request: critical medium error, dev sdf, sector 162334984
> mpt2sas0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
> sd 6:0:5:0: [sdf]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> sd 6:0:5:0: [sdf]  Sense Key : Medium Error [current]
> Info fld=0x9af8892
> sd 6:0:5:0: [sdf]  Add. Sense: Unrecovered read error
> sd 6:0:5:0: [sdf] CDB: Read(10): 28 00 09 af 88 90 00 00 08 00
> end_request: critical medium error, dev sdf, sector 162498704
> mpt2sas0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
> mpt2sas0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
> sd 6:0:5:0: [sdf]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> sd 6:0:5:0: [sdf]  Sense Key : Medium Error [current]
> Info fld=0x9af8892
> sd 6:0:5:0: [sdf]  Add. Sense: Unrecovered read error
> sd 6:0:5:0: [sdf] CDB: Read(10): 28 00 09 af 88 90 00 00 08 00
> end_request: critical medium error, dev sdf, sector 162498704
> {noformat}
> Diskchecker would pass as the system allows to create directories and delete 
> directories without issues.  But data being served out can be corrupt and 
> fetchers fail during CRC verification with unwanted delays and retries. 
> Ideally node manager should detect such errors and blacklist/remove those 
> disks from NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to