Hi Nishant! I'd suggest reading the HDFS user guide to begin with and becoming familiar with the architecture. https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html .
Where are the blocks stored on the datanodes? Were they on persistent storage on the EC2 instances or ephemeral? Can you log on to the the datanodes and find "blk_*" and their corresponding "blk_*" files? e.g. You can identify the locations of an HDFS file using this command: HADOOP_USER_NAME=hdfs hdfs fsck <SOME_FILE_IN_HDFS> -files -blocks -locations If you have Kerberos turned on, then you'd have to get the super-user credentials and run the command as the super-user. If there are no datanodes in the list, that means *no datanodes* have reported the block. NOTE: On startup the Namenode doesn't know where a block is stored. It only has a mapping from an HDFS file to the blocks. The Datanodes are the ones that report a block to the Namenode and then the Namenode remembers (every startup) where to locate the block. HTH Ravi On Wed, Feb 15, 2017 at 11:53 PM, Nishant Verma <[email protected] > wrote: > Hi Philippe > > Yes, I did. I restarted NameNode and other daemons multiple times. > I found that all my files had got corrupted somehow. I was able to fix the > issue by running below command: > > hdfs fsck / | egrep -v '^\.+$' | grep -v replica | grep -v Replica > > But it deleted all the files from my cluster. Only the directory > structures were left. > > My main concern is how did this issue happen and how to prevent it in > future from happening? > > Regards > Nishant > > Nishant > > sent from handheld device. please ignore typos. > > On Wed, Feb 15, 2017 at 3:01 PM, Philippe Kernévez <[email protected]> > wrote: > >> Hi Nishant, >> >> You namenode are probably unable to comunicate with your datanode. Did >> you restart all the HDFS services ? >> >> Regards, >> Philipp >> >> On Tue, Feb 14, 2017 at 10:43 AM, Nishant Verma < >> [email protected]> wrote: >> >>> Hi >>> >>> I have open source hadoop version 2.7.3 cluster (2 Masters + 3 Slaves) >>> installed on AWS EC2 instances. I am using the cluster to integrate it with >>> Kafka Connect. >>> >>> The setup of cluster was done last month and setup of kafka connect was >>> completed last fortnight. Since then, we were able to operate the kafka >>> topic records on our HDFS and do various operations. >>> >>> Since last afternoon, I find that any kafka topic is not getting >>> committed to the cluster. When I tried to open the older files, I started >>> getting below error. When I copy a new file to the cluster from local, it >>> comes and gets opened but after some time, again starts showing similar >>> IOException: >>> >>> 17/02/14 07:57:55 INFO hdfs.DFSClient: No node available for >>> BP-1831277630-10.16.37.124-1484306078618 >>> <(430)%20607-8618>:blk_1073793876_55013 file=/test/inputdata/derby.log >>> 17/02/14 07:57:55 INFO hdfs.DFSClient: Could not obtain >>> BP-1831277630-10.16.37.124-1484306078618 >>> <(430)%20607-8618>:blk_1073793876_55013 from any node: java.io.IOException: >>> No live nodes contain block BP-1831277630-10.16.37.124-1484306078618 >>> <(430)%20607-8618>:blk_1073793876_55013 after checking nodes = [], >>> ignoredNodes = null No live nodes contain current block Block locations: >>> Dead nodes: . Will get new block locations from namenode and retry... >>> 17/02/14 07:57:55 WARN hdfs.DFSClient: DFS chooseDataNode: got # 1 >>> IOException, will wait for 499.3472970548959 msec. >>> 17/02/14 07:57:55 INFO hdfs.DFSClient: No node available for >>> BP-1831277630-10.16.37.124-1484306078618 >>> <(430)%20607-8618>:blk_1073793876_55013 file=/test/inputdata/derby.log >>> 17/02/14 07:57:55 INFO hdfs.DFSClient: Could not obtain >>> BP-1831277630-10.16.37.124-1484306078618 >>> <(430)%20607-8618>:blk_1073793876_55013 from any node: java.io.IOException: >>> No live nodes contain block BP-1831277630-10.16.37.124-1484306078618 >>> <(430)%20607-8618>:blk_1073793876_55013 after checking nodes = [], >>> ignoredNodes = null No live nodes contain current block Block locations: >>> Dead nodes: . Will get new block locations from namenode and retry... >>> 17/02/14 07:57:55 WARN hdfs.DFSClient: DFS chooseDataNode: got # 2 >>> IOException, will wait for 4988.873277172643 msec. >>> 17/02/14 07:58:00 INFO hdfs.DFSClient: No node available for >>> BP-1831277630-10.16.37.124-1484306078618 >>> <(430)%20607-8618>:blk_1073793876_55013 file=/test/inputdata/derby.log >>> 17/02/14 07:58:00 INFO hdfs.DFSClient: Could not obtain >>> BP-1831277630-10.16.37.124-1484306078618 >>> <(430)%20607-8618>:blk_1073793876_55013 from any node: java.io.IOException: >>> No live nodes contain block BP-1831277630-10.16.37.124-1484306078618 >>> <(430)%20607-8618>:blk_1073793876_55013 after checking nodes = [], >>> ignoredNodes = null No live nodes contain current block Block locations: >>> Dead nodes: . Will get new block locations from namenode and retry... >>> 17/02/14 07:58:00 WARN hdfs.DFSClient: DFS chooseDataNode: got # 3 >>> IOException, will wait for 8598.311122824263 msec. >>> 17/02/14 07:58:09 WARN hdfs.DFSClient: Could not obtain block: >>> BP-1831277630-10.16.37.124-1484306078618 >>> <(430)%20607-8618>:blk_1073793876_55013 file=/test/inputdata/derby.log No >>> live nodes contain current block Block locations: Dead nodes: . Throwing a >>> BlockMissingException >>> 17/02/14 07:58:09 WARN hdfs.DFSClient: Could not obtain block: >>> BP-1831277630-10.16.37.124-1484306078618 >>> <(430)%20607-8618>:blk_1073793876_55013 file=/test/inputdata/derby.log No >>> live nodes contain current block Block locations: Dead nodes: . Throwing a >>> BlockMissingException >>> 17/02/14 07:58:09 WARN hdfs.DFSClient: DFS Read >>> org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: >>> BP-1831277630-10.16.37.124-1484306078618 >>> <(430)%20607-8618>:blk_1073793876_55013 file=/test/inputdata/derby.log >>> at >>> org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:983) >>> at >>> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:642) >>> at >>> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882) >>> at >>> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934) >>> at java.io.DataInputStream.read(DataInputStream.java:100) >>> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:85) >>> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:59) >>> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:119) >>> at >>> org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:107) >>> at >>> org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:102) >>> at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317) >>> at >>> org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289) >>> at >>> org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271) >>> at >>> org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255) >>> at >>> org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:201) >>> at org.apache.hadoop.fs.shell.Command.run(Command.java:165) >>> at org.apache.hadoop.fs.FsShell.run(FsShell.java:287) >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) >>> at org.apache.hadoop.fs.FsShell.main(FsShell.java:340) >>> cat: Could not obtain block: BP-1831277630-10.16.37.124-1484306078618 >>> <(430)%20607-8618>:blk_1073793876_55013 file=/test/inputdata/derby.log >>> >>> When I do : hdfs fsck / , I get: >>> >>> Total size: 667782677 B >>> Total dirs: 406 >>> Total files: 44485 >>> Total symlinks: 0 >>> Total blocks (validated): 43767 (avg. block size 15257 B) >>> ******************************** >>> UNDER MIN REPL'D BLOCKS: 43766 (99.99772 %) >>> dfs.namenode.replication.min: 1 >>> CORRUPT FILES: 43766 >>> MISSING BLOCKS: 43766 >>> MISSING SIZE: 667781648 B >>> CORRUPT BLOCKS: 43766 >>> ******************************** >>> Minimally replicated blocks: 1 (0.0022848265 %) >>> Over-replicated blocks: 0 (0.0 %) >>> Under-replicated blocks: 0 (0.0 %) >>> Mis-replicated blocks: 0 (0.0 %) >>> Default replication factor: 3 >>> Average block replication: 6.8544796E-5 >>> Corrupt blocks: 43766 >>> Missing replicas: 0 (0.0 %) >>> Number of data-nodes: 3 >>> Number of racks: 1 >>> FSCK ended at Tue Feb 14 07:59:10 UTC 2017 in 932 milliseconds >>> >>> >>> The filesystem under path '/' is CORRUPT >>> >>> That means, all my files got corrupted somehow. >>> >>> I want to recover my HDFS and fix the corrupt health status. Also, I >>> would like to understand, how such an issue occurred suddenly and how to >>> prevent it in future? >>> >>> >>> Thanks >>> >>> Nishant Verma >>> >> >> >> >> -- >> Philippe Kernévez >> >> >> >> Directeur technique (Suisse), >> [email protected] >> +41 79 888 33 32 <+41%2079%20888%2033%2032> >> >> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com >> OCTO Technology http://www.octo.com >> > >
