Re: HDFS fsck command giving health as corrupt for '/'

Ravi Prakash Thu, 16 Feb 2017 13:21:58 -0800

Hi Nishant!

I'd suggest reading the HDFS user guide to begin with and becoming familiar
with the architecture.
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
.


Where are the blocks stored on the datanodes? Were they on persistent
storage on the EC2 instances or ephemeral? Can you log on to the the
datanodes and find "blk_*" and their corresponding "blk_*" files?

e.g. You can identify the locations of an HDFS file using this command:
HADOOP_USER_NAME=hdfs hdfs fsck <SOME_FILE_IN_HDFS> -files -blocks
-locations
If you have Kerberos turned on, then you'd have to get the super-user
credentials and run the command as the super-user.

If there are no datanodes in the list, that means *no datanodes* have
reported the block. NOTE: On startup the Namenode doesn't know where a
block is stored. It only has a mapping from an HDFS file to the blocks. The
Datanodes are the ones that report a block to the Namenode and then the
Namenode remembers (every startup) where to locate the block.

HTH
Ravi


On Wed, Feb 15, 2017 at 11:53 PM, Nishant Verma <[email protected]
> wrote:

> Hi Philippe
>
> Yes, I did. I restarted NameNode and other daemons multiple times.
> I found that all my files had got corrupted somehow. I was able to fix the
> issue by running below command:
>
> hdfs fsck / | egrep -v '^\.+$' | grep -v replica | grep -v Replica
>
> But it deleted all the files from my cluster. Only the directory
> structures were left.
>
> My main concern is how did this issue happen and how to prevent it in
> future from happening?
>
> Regards
> Nishant
>
> Nishant
>
> sent from handheld device. please ignore typos.
>
> On Wed, Feb 15, 2017 at 3:01 PM, Philippe Kernévez <[email protected]>
> wrote:
>
>> Hi Nishant,
>>
>> You namenode are probably unable to comunicate with your datanode. Did
>> you restart all the HDFS services ?
>>
>> Regards,
>> Philipp
>>
>> On Tue, Feb 14, 2017 at 10:43 AM, Nishant Verma <
>> [email protected]> wrote:
>>
>>> Hi
>>>
>>> I have open source hadoop version 2.7.3 cluster (2 Masters + 3 Slaves)
>>> installed on AWS EC2 instances. I am using the cluster to integrate it with
>>> Kafka Connect.
>>>
>>> The setup of cluster was done last month and setup of kafka connect was
>>> completed last fortnight. Since then, we were able to operate the kafka
>>> topic records on our HDFS and do various operations.
>>>
>>> Since last afternoon, I find that any kafka topic is not getting
>>> committed to the cluster. When I tried to open the older files, I started
>>> getting below error. When I copy a new file to the cluster from local, it
>>> comes and gets opened but after some time, again starts showing similar
>>> IOException:
>>>
>>> 17/02/14 07:57:55 INFO hdfs.DFSClient: No node available for 
>>> BP-1831277630-10.16.37.124-1484306078618 
>>> <(430)%20607-8618>:blk_1073793876_55013 file=/test/inputdata/derby.log
>>> 17/02/14 07:57:55 INFO hdfs.DFSClient: Could not obtain 
>>> BP-1831277630-10.16.37.124-1484306078618 
>>> <(430)%20607-8618>:blk_1073793876_55013 from any node: java.io.IOException: 
>>> No live nodes contain block BP-1831277630-10.16.37.124-1484306078618 
>>> <(430)%20607-8618>:blk_1073793876_55013 after checking nodes = [], 
>>> ignoredNodes = null No live nodes contain current block Block locations: 
>>> Dead nodes: . Will get new block locations from namenode and retry...
>>> 17/02/14 07:57:55 WARN hdfs.DFSClient: DFS chooseDataNode: got # 1 
>>> IOException, will wait for 499.3472970548959 msec.
>>> 17/02/14 07:57:55 INFO hdfs.DFSClient: No node available for 
>>> BP-1831277630-10.16.37.124-1484306078618 
>>> <(430)%20607-8618>:blk_1073793876_55013 file=/test/inputdata/derby.log
>>> 17/02/14 07:57:55 INFO hdfs.DFSClient: Could not obtain 
>>> BP-1831277630-10.16.37.124-1484306078618 
>>> <(430)%20607-8618>:blk_1073793876_55013 from any node: java.io.IOException: 
>>> No live nodes contain block BP-1831277630-10.16.37.124-1484306078618 
>>> <(430)%20607-8618>:blk_1073793876_55013 after checking nodes = [], 
>>> ignoredNodes = null No live nodes contain current block Block locations: 
>>> Dead nodes: . Will get new block locations from namenode and retry...
>>> 17/02/14 07:57:55 WARN hdfs.DFSClient: DFS chooseDataNode: got # 2 
>>> IOException, will wait for 4988.873277172643 msec.
>>> 17/02/14 07:58:00 INFO hdfs.DFSClient: No node available for 
>>> BP-1831277630-10.16.37.124-1484306078618 
>>> <(430)%20607-8618>:blk_1073793876_55013 file=/test/inputdata/derby.log
>>> 17/02/14 07:58:00 INFO hdfs.DFSClient: Could not obtain 
>>> BP-1831277630-10.16.37.124-1484306078618 
>>> <(430)%20607-8618>:blk_1073793876_55013 from any node: java.io.IOException: 
>>> No live nodes contain block BP-1831277630-10.16.37.124-1484306078618 
>>> <(430)%20607-8618>:blk_1073793876_55013 after checking nodes = [], 
>>> ignoredNodes = null No live nodes contain current block Block locations: 
>>> Dead nodes: . Will get new block locations from namenode and retry...
>>> 17/02/14 07:58:00 WARN hdfs.DFSClient: DFS chooseDataNode: got # 3 
>>> IOException, will wait for 8598.311122824263 msec.
>>> 17/02/14 07:58:09 WARN hdfs.DFSClient: Could not obtain block: 
>>> BP-1831277630-10.16.37.124-1484306078618 
>>> <(430)%20607-8618>:blk_1073793876_55013 file=/test/inputdata/derby.log No 
>>> live nodes contain current block Block locations: Dead nodes: . Throwing a 
>>> BlockMissingException
>>> 17/02/14 07:58:09 WARN hdfs.DFSClient: Could not obtain block: 
>>> BP-1831277630-10.16.37.124-1484306078618 
>>> <(430)%20607-8618>:blk_1073793876_55013 file=/test/inputdata/derby.log No 
>>> live nodes contain current block Block locations: Dead nodes: . Throwing a 
>>> BlockMissingException
>>> 17/02/14 07:58:09 WARN hdfs.DFSClient: DFS Read
>>> org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: 
>>> BP-1831277630-10.16.37.124-1484306078618 
>>> <(430)%20607-8618>:blk_1073793876_55013 file=/test/inputdata/derby.log
>>>         at 
>>> org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:983)
>>>         at 
>>> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:642)
>>>         at 
>>> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882)
>>>         at 
>>> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
>>>         at java.io.DataInputStream.read(DataInputStream.java:100)
>>>         at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:85)
>>>         at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:59)
>>>         at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:119)
>>>         at 
>>> org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:107)
>>>         at 
>>> org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:102)
>>>         at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317)
>>>         at 
>>> org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289)
>>>         at 
>>> org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271)
>>>         at 
>>> org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255)
>>>         at 
>>> org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:201)
>>>         at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
>>>         at org.apache.hadoop.fs.FsShell.run(FsShell.java:287)
>>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
>>>         at org.apache.hadoop.fs.FsShell.main(FsShell.java:340)
>>> cat: Could not obtain block: BP-1831277630-10.16.37.124-1484306078618 
>>> <(430)%20607-8618>:blk_1073793876_55013 file=/test/inputdata/derby.log
>>>
>>> When I do : hdfs fsck / , I get:
>>>
>>> Total size:    667782677 B
>>>  Total dirs:    406
>>>  Total files:   44485
>>>  Total symlinks:                0
>>>  Total blocks (validated):      43767 (avg. block size 15257 B)
>>>   ********************************
>>>   UNDER MIN REPL'D BLOCKS:      43766 (99.99772 %)
>>>   dfs.namenode.replication.min: 1
>>>   CORRUPT FILES:        43766
>>>   MISSING BLOCKS:       43766
>>>   MISSING SIZE:         667781648 B
>>>   CORRUPT BLOCKS:       43766
>>>   ********************************
>>>  Minimally replicated blocks:   1 (0.0022848265 %)
>>>  Over-replicated blocks:        0 (0.0 %)
>>>  Under-replicated blocks:       0 (0.0 %)
>>>  Mis-replicated blocks:         0 (0.0 %)
>>>  Default replication factor:    3
>>>  Average block replication:     6.8544796E-5
>>>  Corrupt blocks:                43766
>>>  Missing replicas:              0 (0.0 %)
>>>  Number of data-nodes:          3
>>>  Number of racks:               1
>>> FSCK ended at Tue Feb 14 07:59:10 UTC 2017 in 932 milliseconds
>>>
>>>
>>> The filesystem under path '/' is CORRUPT
>>>
>>> That means, all my files got corrupted somehow.
>>>
>>> I want to recover my HDFS and fix the corrupt health status. Also, I
>>> would like to understand, how such an issue occurred suddenly and how to
>>> prevent it in future?
>>>
>>>
>>> Thanks
>>>
>>> Nishant Verma
>>>
>>
>>
>>
>> --
>> Philippe Kernévez
>>
>>
>>
>> Directeur technique (Suisse),
>> [email protected]
>> +41 79 888 33 32 <+41%2079%20888%2033%2032>
>>
>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
>> OCTO Technology http://www.octo.com
>>
>
>

Re: HDFS fsck command giving health as corrupt for '/'

Reply via email to