Hello Chathuri, I have experienced this before. When the disk cannot handle the write ops, it tries to save itself by locking itself and becomes read only. (Quick fix : restart your server, Long term fix: tune HBase params like writes / file size).
I am not a sys admin, I might be wrong. (You should manually check the state of all disks in you cluster ) Check /var/log/messages to understand under what circumstances your SSDs failed. Krishna On Tue, Dec 27, 2016 at 8:54 PM, Chathuri Wimalasena <[email protected]> wrote: > Hi, > > We have a hadoop cluster which has 3 login nodes and 10 data nodes. We are > running hadoop 2.7.1 with HBase 0.94.23. Both hadoop and HBase running on > logging node 2. We are facing a terrible issue with our hadoop cluster > recently. There are lot of files in HDFS in corrupt state. We are unable to > figure out what cause this mass corruption and how to recover from it. HDFS > has 40 TB of data and we are worried that we might have to rebuild the > cluster from scratch due to this errors. Our cluster had some file system > issues recently. Below is the list of events that took place before that. > Both Hadoop and HBase are running on ln02 (logging node 2). > > - Nov 30 - SSD drives on ln02 node has died which triggered a kernel > panic and reboot. > - Dec 20 - ln02 file system set to Read-only and both hard drives on > ln02 died. Sys admin removed and reinstalled the SSD drives on ln02, and > rebooted, and it came back up. One data node was also down on the same day > due to disk failure. > - Dec 21 - Same thing happen as Dec 20th and ln02 was rebooted. Sys > admin replaced the failed SSD with another SSD. Another data node was down > on the same day. > > On nov 30th and Dec 20 th after sys admin rebooted the node, I was able to > restart Hadoop and HBase without any issue. Everything worked as expected. > But on Dec 21st, when I restarted Hadoop, it has automatically switch to > the "Safe mode" and hadoop fs fsck command showed lot of corrupt and > missing files. Output of fsck is below. > ............................Status: CORRUPT > Total size: 46454858557036 B (Total open files size: 1340 B) > Total dirs: 43405 > Total files: 122028 > Total symlinks: 0 (Files currently being written: 10) > Total blocks (validated): 804832 (avg. block size 57719944 B) (Total > open file blocks (not validated): 10) > ******************************** > UNDER MIN REPL'D BLOCKS: 413578 (51.386875 %) > dfs.namenode.replication.min: 1 > CORRUPT FILES: 18683 > MISSING BLOCKS: 413578 > MISSING SIZE: 26785603097998 B > CORRUPT BLOCKS: 413578 > ******************************** > Minimally replicated blocks: 391254 (48.613125 %) > Over-replicated blocks: 26548 (3.2985766 %) > Under-replicated blocks: 286 (0.035535365 %) > Mis-replicated blocks: 0 (0.0 %) > Default replication factor: 3 > Average block replication: 1.4916517 > Corrupt blocks: 413578 > Missing replicas: 572 (0.023681387 %) > Number of data-nodes: 10 > Number of racks: 1 > FSCK ended at Sat Dec 24 13:25:10 EST 2016 in 8378 milliseconds > > > The filesystem under path '/' is CORRUPT > > > HDFS web ui shows below message. > > *Safe mode is ON. The reported blocks 391254 needs additional 412774 > blocks to reach the threshold 0.9990 of total blocks 804832. The number of > live datanodes 10 has reached the minimum number 0. Safe mode will be > turned off automatically once the thresholds have been reached.* > > We experienced some data nodes showing Input/output errors intermittently > as well. > > Anyone experienced such situation before and any idea to recover from this > is greatly appreciated. > Thanks, > Chathuri >
