Re: How to recover from CORRUPT HDFS state

Krishna Kalyan Tue, 27 Dec 2016 15:09:51 -0800

Hello Chathuri,
I have experienced this before. When the disk cannot handle the write ops,
it tries to save itself by locking itself and becomes read only. (Quick fix
: restart your server, Long term fix: tune HBase params like writes / file
size).


I am not a sys admin, I might be wrong.

(You should manually check the state of all disks in you cluster )
Check /var/log/messages to understand under what circumstances your SSDs
failed.

Krishna


On Tue, Dec 27, 2016 at 8:54 PM, Chathuri Wimalasena <[email protected]>
wrote:

> Hi,
>
> We have a hadoop cluster which has 3 login nodes and 10 data nodes. We are
> running hadoop 2.7.1 with HBase 0.94.23. Both hadoop and HBase running on
> logging node 2. We are facing a terrible issue with our hadoop cluster
> recently. There are lot of files in HDFS in corrupt state. We are unable to
> figure out what cause this mass corruption and how to recover from it. HDFS
> has 40 TB of data and we are worried that we might have to rebuild the
> cluster from scratch due to this errors. Our cluster had some file system
> issues recently. Below is the list of events that took place before that.
> Both Hadoop and HBase are running on ln02 (logging node 2). 
>
>    - Nov 30 - SSD drives on ln02 node has died which triggered a kernel
>    panic and reboot.
>    - Dec 20 - ln02 file system set to Read-only and both hard drives on
>    ln02 died. Sys admin removed and reinstalled the SSD drives on ln02, and
>    rebooted, and it came back up. One data node was also down on the same day
>    due to disk failure.
>    - Dec 21 - Same thing happen as Dec 20th and ln02 was rebooted. Sys
>    admin replaced the failed SSD with another SSD. Another data node was down
>    on the same day.
>
> On nov 30th and Dec 20 th after sys admin rebooted the node, I was able to
> restart Hadoop and HBase without any issue. Everything worked as expected.
> But on Dec 21st, when I restarted Hadoop, it has automatically switch to
> the "Safe mode" and hadoop fs fsck command showed lot of corrupt and
> missing files. Output of fsck is below.
> ............................Status: CORRUPT
>  Total size:    46454858557036 B (Total open files size: 1340 B)
>  Total dirs:    43405
>  Total files:   122028
>  Total symlinks:                0 (Files currently being written: 10)
>  Total blocks (validated):      804832 (avg. block size 57719944 B) (Total
> open file blocks (not validated): 10)
>   ********************************
>   UNDER MIN REPL'D BLOCKS:      413578 (51.386875 %)
>   dfs.namenode.replication.min: 1
>   CORRUPT FILES:        18683
>   MISSING BLOCKS:       413578
>   MISSING SIZE:         26785603097998 B
>   CORRUPT BLOCKS:       413578
>   ********************************
>  Minimally replicated blocks:   391254 (48.613125 %)
>  Over-replicated blocks:        26548 (3.2985766 %)
>  Under-replicated blocks:       286 (0.035535365 %)
>  Mis-replicated blocks:         0 (0.0 %)
>  Default replication factor:    3
>  Average block replication:     1.4916517
>  Corrupt blocks:                413578
>  Missing replicas:              572 (0.023681387 %)
>  Number of data-nodes:          10
>  Number of racks:               1
> FSCK ended at Sat Dec 24 13:25:10 EST 2016 in 8378 milliseconds
>
>
> The filesystem under path '/' is CORRUPT
>
>
> HDFS web ui shows below message.
>
> *Safe mode is ON. The reported blocks 391254 needs additional 412774
> blocks to reach the threshold 0.9990 of total blocks 804832. The number of
> live datanodes 10 has reached the minimum number 0. Safe mode will be
> turned off automatically once the thresholds have been reached.*
>
> We experienced some data nodes showing Input/output errors intermittently
> as well.
>
> Anyone experienced such situation before and any idea to recover from this
> is greatly appreciated.
> Thanks,
> Chathuri
>

Re: How to recover from CORRUPT HDFS state

Reply via email to